Summary of all the vulnerabilities reported by Cisco Talos in Microsoft Azure Sphere
By Claudio Bozzato and Lilith [>_>].
In May 2020, Microsoft kicked off the Azure Sphere Security Research Challenge, a three-month initiative aimed at finding bugs in Azure Sphere. In the first three months, Cisco Talos reported 16 vulnerabilities. Our analysis continued intermittently, and eventually, we discovered and reported a total of 31 published vulnerabilities, two of which were present in the Linux kernel itself.
We already released several blog posts about Azure Sphere (see blog posts 1, 2, 3, 4, 5). Today, we’re putting a bow on our research by summarizing what we’ve found and how attackers could exploit them, and what that would mean for the user. We also have another blog post coming next week that will detail how we exploited a chain of two vulnerabilities to gain arbitrary kernel code execution.
The Azure Sphere platform is a cloud-connected and custom SoC platform designed specifically for internet-of-things (IoT) application security. Internally, the SoC is made up of several ARM cores that have different roles (e.g., running different types of applications, enforcing security, and managing encryption). The most relevant components to understand our analysis are:
- Normal World: An OS running the Linux kernel. There is no real userland environment (no shell, utilities, etc.), the init binary is application-manager, a custom application that takes care of spawning a few services. This runs on the Cortex A7 core. Userland can can talk to Secmon and Pluton by issuing an ioctl respectively to /dev/security-monitor and /dev/pluton, while the kernel side can talk to Secmon via SMC (the ARM instruction used to call into TrustZone), and to Pluton via mailboxes (see the Linux mailbox framework, used for interprocessor communication).
- Security monitor aka Secmon (Secure world): The ARM TrustZone implementation for Azure Sphere. It communicates with the Normal World via SMC. This runs on the Cortex A7 core, as well.
- Pluton: An M4 core that runs security-sensitive operations. It communicates with the rest of the cores via mailboxes.
- Real-time cores: Two M4 cores at the complete disposal of the developers. These normally communicate with the rest of the cores via mailboxes.
Note that each of the cores above can set up DMA buffers to share info with any other core.
We find the most critical functionality in Secmon and Pluton: they might be seen as one same “high-privileged” entry. In reality, Pluton is at a higher-privileged level, as it takes care of more sensitive operations, and is even running on a separate core. Just to give an example, when flashing an application these are the sequential steps (greatly simplified) going on:
- The Azured process in the Normal World handles the image to be flashed (received via sideload or the cloud), and performs an ioctl on /dev/security-monitor for flashing the image.
- The Linux kernel sets up a DMA buffer and stores the image, and calls a Secmon syscall via SMC for flashing.
- Secmon reads the image, sends a mailbox message to Pluton to verify the image.
- Pluton receives the message, and returns the verification result to Secmon.
- If verification is successful, Secmon writes the image to flash.
Here’s a simplified logical chart of the system:
Externally, the Azure Sphere platform is supported by Microsoft’s Azure Cloud, which handles secure updates, app deployment, and periodic verification of device integrity to determine if Azure Cloud access should be allowed or not. While Azure Sphere is updated and deploys through the Azure Cloud, customers can still interact with their own servers independently.
Customers push signed applications (about 600KB maximum) to their devices grouped in an Azure Sphere Cloud Tenant (or sideload if in development mode), and are granted extremely limited permissions by default. To use such basic features as connecting to an IP address or hostname, storing any data to disk, or even delaying software updates, a given application must pre-define these needs inside their application manifest (app_manifest.json, eventually stored in the image package that gets flashed). Materially, these definitions cause the user ID (UID) of the application (which is different on every installation) to be granted specific Linux group IDs (GIDs) and/or the Linux capabilities needed to interact with the requested feature.
All system applications (networkd, azcore, azured, etc.) have specific Linux and/or Azure Sphere capabilities to limit their access to only what they need. These Azure Sphere capabilities, stored and treated differently than normal Linux capabilities, limit access to critical Azure Sphere-specific interfaces and for the most part are used to limit access specifically to the ioctls of /dev/pluton and /dev/security-monitor.
Normal world userland
There aren’t any shell tools in the device (no sh, cat, ps, etc.), so to easily analyze the userland side, we cross-compiled busybox and installed it as an app, tunneling the I/O over a network socket.
Let’s briefly show how scarce the userland side is.
The application can only see its own processes:
The application’s directory structure is just a manifest and our app binary that launches the busybox binary that we cross-compiled:
View of the filesystem starting from /:
Azure Sphere’s security model
Microsoft details the “seven properties of highly secured devices” that Azure Sphere implements. While this doesn’t clearly describe the security model of the device, this is our understanding of the security guarantees that Azure Sphere provides in production mode:
- Separated components by security privilege (from bottom to top, lower levels have lower privileges): customer apps, Normal World kernel, Security Monitor, Pluton, 1BL (“first” boot loader).
- Compartmentalized security at the hardware level thanks to a hardware firewall: A fully compromised core will still have limited ability to talk to other cores, depending on firewall rules.
- Any code running on the device (be it customer’s or Microsoft’s code) must be signed, either by Microsoft or the customer.
- Microsoft acknowledges that via ROP (and similar techniques), it’s still possible to run arbitrary logic in the device. However, their security guarantee is that it’s not possible to write to memory or flash and execute unsigned code from it.
- The device’s identity can’t be spoofed (there’s a per-device private key stored in hardware and crypto operations are performed by a hardware crypto processor).
- Remote attestation ensures that the code running on the device has not been tampered with.
- Automatic security updates.
- The device can’t be downgraded to an older (vulnerable) version.
A customer can deploy signed applications to their Azure Sphere devices. This was the starting point for our analysis.
All of the vulnerabilities we discovered assume that a customer application (on the Normal World side) has been compromised and code execution gained (via ROP).
From this point, there can be many avenues for escalation:
- Execute unsigned code that isn’t pure return-oriented programming (ROP) under Linux.
- Elevate privileges outside of the capabilities described in the application manifest (e.g. changing user ID, adding access to a binary).
- Execute code on a system application.
- Execute code on the Linux kernel.
- Execute code on Secmon.
- Execute code on Pluton.
The vulnerabilities we’ve found, in one way or another, cover all the points above. We’ll now go through all of them in groups based on various escalation stages.
The most critical chain we achieved followed these steps:
- ROP capability in a vulnerable customer app — assume we start from here.
- Run arbitrary code in the app via an unsigned code execution issue.
- Execute arbitrary code in the Linux kernel - via information leakage and memory corruption issues.
- Either an arbitrary read in secmon memory via an information leakage, or a firmware downgrade, to flash a vulnerable version of any firmware (Secmon, Pluton, 1BL) in the device.
Unsigned code execution in Normal World (Linux)
According to the Azure Sphere security model, all code running on the Azure Sphere device must be signed, either by Microsoft or the application developer. Practically speaking, this means all executable data on the device is only located within the ASXipFS partitions that all Azure Sphere applications (and the rootfs) comprise of. Since the ASXipFS filesystem kernel driver does not support any sort of writes, and also since the Littlefs filesystem (used for storing non-volatile data) is mounted as noexec, Azure Sphere extends the concept of W^X from process memory to the disk itself, creating a defacto W^X protection for the entire device — protection against “unsigned code execution.”
On the runtime side, to keep an attacker from changing page permissions and injecting code into process memory, protections for the mprotect and mmap syscalls were implemented in a custom Azure Sphere Linux Security Module (LSM). Using the VM_MAY* flags of virtual memory pages, Azure Sphere prohibits the changing of a given page’s permissions to executable if it had ever been writable in the past (it also prohibits a writable and executable page from ever being mapped in the first place).
Thus, being able to write to and then execute memory inside a given process is actually a non-trivial endeavor. We found seven to do this in Linux (where unspecified, no special preconditions are required):
- TALOS-2020-1093: Write to /proc/self/mem to modify the .text section.
- TALOS-2020-1138: Write to /proc/thread-self/mem to modify the .text section. This was found after TALOS-2020-1093 had been partially fixed.
- TALOS-2020-1090: Use ptrace to attach to the process and use ptrace's POKETEXT to modify the .text section.
- TALOS-2021-1249: Use ptrace as above, but in a namespace created with unshare or clone. This is a different vulnerability because it bypasses a different check in the kernel.
- TALOS-2021-1247: Create a new mount namespace, mount a tmpfs, write an executable in the tmpfs directory and execute it.
- TALOS-2020-1128: Set the READ_IMPLIES_EXEC personality, so that all mmap calls requesting read permissions also have exec permissions. This creates an rwx map that we can write to and execute from.
- TALOS-2020-1134: Use the PACKET_MMAP feature to create an AF_PACKET (requires CAP_NET_RAW) and mmap it as read+exec. Populate this buffer with a shellcode, by using another process that writes to the network interface, finally executing from the buffer.
Normal World (Linux) vulnerabilities
The denial-of-service vulnerability below prevents the device from booting (bootloop), requiring manual recovery.
- TALOS-2020-1129: Use the truncate syscall to bypass the littlefs quota, which corrupts the /mnt/config partition, thus preventing the device from booting.
These are all kernel information disclosures, except the first one.
- TALOS-2020-1089: Use klogctl to read the kernel ring buffer and leak the ASLR offset of system applications.
- TALOS-2020-1130: Use the truncate syscall to leak kernel memory (not arbitrary), exploiting littlefs' lack of memory pages cleanup upon reuse.
- TALOS-2021-1339: Use the GPIO_SET_PIN_CONFIG_IOCTL ioctl against /dev/gpiochip0 for a partially-controlled kernel memory leak.
- TALOS-2021-1340: Use the GPIO_GET_PIN_ACCESS_CONTROL_USER ioctl against /dev/gpiochip0 to leak arbitrary kernel memory.
These are information disclosures found in the upstream Linux kernel, that we discovered while analyzing Azure Sphere, they apply to any ARM 32-bit system:
- TALOS-2020-1211: Read /proc/pid/syscall to leak kernel memory (not arbitrary).
- TALOS-2021-1243: Read the SIGPAGE memory map in any process’ address space to leak kernel memory (not arbitrary).
Memory corruption vulnerabilities that lead to arbitrary kernel code execution.
- TALOS-2020-1118: Bind an AF_AZSPIOsocket twice to cause a double-free in a kernel linked list.
- TALOS-2021-1250: Use namespaces to exploit a missing initialization in mqueue inodes in the kernel.
- TALOS-2021-1262: Use the PWM_APPLY_STATE ioctl against /dev/pwm0 to execute kfree on an arbitrary kernel address. Lilith Wyatt will be discussing how to exploit this in detail at Hitcon 2021 on Nov. 26, be sure to check out her talk.
Privilege escalation chain
This is a miscellaneous set of vulnerabilities that, when chained in the following order, allowed us to call arbitrary Secmon and Pluton syscalls.
- TALOS-2020-1131: Use a crafted image package to access arbitrary devices via ASXipFS inode. Such an image can be flashed via a signature handling issue found by McAfee ATR.
- TALOS-2020-1132: Use the MEMWRITE ioctl on /dev/mtd1 to write the config partition and modify the uid_map file, which maps processes to UIDs.
- TALOS-2020-1137: Duplicate the entry for the azured process in the uid_map file
- Reboot the device using any denial-of-service (e.g., TALOS-2020-1129, but memory corruptions work too of course), this applies the new uid_map and gives us the elevated azured UID.
- TALOS-2020-1133: Use ptrace to attach to the azured process, and inject a shell. This gives azured capabilities, so we can talk to Secmon (e.g. for flashing firmware images) and talk to Pluton.
Security Monitor (Trusted OS) vulnerabilities
Denial of service
The denial-of-service vulnerabilities below prevent the device from booting (bootloop), requiring manual recovery.
- TALOS-2021-1311: Flash a crafted image package with an invalid list_size field in the SECTION_ABIDepends section. The flashing operation will work, but at boot, the image will be parsed and Secmon will crash, boot looping the device.
- TALOS-2021-1341: Flash a legitimate, but old, Trusted Keystore. The keys contained therein can’t be used to verify the current firmware images, preventing Pluton from completing its verification checks at boot. Note: this vulnerability is theoretical and was discovered in development mode but has not been confirmed in production by either Talos or Microsoft, see the advisory for more details.
- TALOS-2021-1309: Use the SMSyscallPeripheralAcquire syscall to read an uninitialized field in the peripheral structure, leading to the disclosure of Secmon memory (not arbitrary).
- TALOS-2021-1310: Use the SMSyscallWriteBlockToStageImage syscall to write arbitrary Secmon memory to flash. Then use the SMSyscallReadFlash syscall to read the flash contents, leading to arbitrary information disclosure.
- TALOS-2021-1343: Use the SMSyscallStageBaseManifests to stage a manifest at an arbitrary offset in Secmon’s memory by exploiting a missing bounds check for a read operation.
The implications of this last vulnerability (TALOS-2021-1343) aren’t immediately clear. We think that this can be turned into an information leak. However, it hasn’t been implemented fully since it’s rather time-consuming, but the theory goes:
- Use the SMSyscallStageBaseManifests to stage a manifest at an arbitrary offset in Secmon’s memory by exploiting a missing bounds check for a read operation. Staging a manifest anywhere doesn’t directly give us any info leak, plus, the manifest header is subject to many checks before the actual manifest contents are read. After those checks, however, some data is read again from the manifest buffer, leaving room for a TOCTTOU.
- To bypass the above checks, ideally, one would call the SMSyscallStageBaseManifests syscall and in parallel change the manifest header to a corrupted one. This is not possible from the Normal World, since from Linux we’re issuing an SMC instruction to call Secmon syscalls, and SMC is a synchronous exception.
- Recall that we have real-time cores (M4) at our disposal, and those can obviously run in parallel with any SMC since it’s a separate core from the A7.
- So, how to stage a manifest with corrupted headers?
- Allocate a DMA buffer and instruct a real-time core to write to it continuously.
- Via SMSyscallStageBaseManifests, stage a manifest in the allocated DMA memory (recall that DMA is shared with Linux, Secmon, Pluton, and real-time cores).
- By telling the real-time core to write a corrupted manifest, at some point we’ll hit the TOCTTOU and manage to stage a manifest with corrupted headers.
- After such a manifest is staged, we could leak (non-arbitrarily and with some constraints) Secmon memory via the SMSyscallGetMissingBaseImagesToDownload syscall.
- TALOS-2021-1342: Use an image-type field larger than 0x19 in a crafted manifest image to bypass signature checks while staging a base manifest with the SMSyscallStageBaseManifests syscall. This allows us to create an arbitrary base manifest, which in turn lets us flash any (Microsoft-signed) image, for any version.
- TALOS-2021-1344: Stage an arbitrary (old) recovery manifest and use SMSyscallCommitImageStaging to flash the 1BL referenced in the recovery manifest. This allows us to downgrade the 1BL to any version. Note: this vulnerability is theoretical and was discovered in development mode but has not been confirmed in production by either Talos or Microsoft, see the advisory for more details.
It’s important to note that after discovering these downgrade issues, we haven’t looked for vulnerabilities in older versions. A motivated attacker, however, would have certainly done so. For example, they could find a vulnerability in image parsing in an older 1BL or Pluton version, allowing an attacker to maintain persistence on the device, even bypassing remote attestation.
Denial of service
The denial-of-service vulnerabilities below are not persistent but are triggerable from the Normal World userland. Normally, for an app to have the ability to reboot the device, it should declare a PowerControls section in its manifest. These vulnerabilities allow rebooting the device without such permission.
- TALOS-2020-1117: From an unprivileged app on the Linux side, send asynchronous ioctls repeatedly to fill Pluton’s ring buffer and trigger a watchdog that would reboot the device.
- TALOS-2021-1347: From an unprivileged app on the Linux side, send ioctls repeatedly to hit a rate-limit in Pluton which is handled with a device reboot. This has not been fixed by Microsoft since they consider this intended behavior.
This is the one memory corruption found in Pluton, but according to Microsoft, this is not exploitable.
- TALOS-2020-1139: Out-of-bounds write with a buffer with arbitrary contents, via the SIGN_WITH_TENANT_ATTESTATION_KEY Pluton syscall.
As part of our research into Azure Sphere, we disclosed and helped fix 31 vulnerabilities that we’ve found in the device, starting from code execution in a customer application, up to firmware downgrade and Secmon information disclosures. There are many ways to chain the above vulnerabilities to reach this goal, just pick in sequence:
- Any issue from the “Unsigned code execution” section.
- Any “Information disclosure” and “Memory corruption” from the “Normal World” section. Alternatively, the “Privilege escalation chain” subsection for the “Normal World.”
- A “Firmware downgrade” or an “Information disclosure” issue from the “Security Monitor” section.
As discussed in the “Firmware downgrade” section, this is one of the most critical exploitation paths, since it would lead to a full compromise of the device, which, in the worst-case scenario, might allow an attacker to maintain persistence in the device, even bypassing remote attestation.
- However, even without fully compromising Pluton or Security Monitor, an attacker could still carry out a range of malicious acts, including:
- Arbitrary userland/kernel code can be run, so the Linux side can be controlled by an attacker over the network. For example, it can be made part of a botnet.
- All company/customer applications on the box would obviously be fully compromised as well, together with their data.
- Remote attestation would likely not be able to verify that the Linux Normal World has been tampered with, because this exploit chain does not require any changes in the firmware of the device.
- Updates could easily be disabled (they’re handled by a userland process), although this could in theory be detected when a new version gets released, as the automatic updates would fail.
- The device could be bricked (until manual recovery over USB) in several different ways (TALOS-2020-1129, TALOS-2021-1311, TALOS-2021-1341).
- The ability to turn off the device via software could be disabled since we can simply discard such requests.
Also worth noting, in theory, since an exploit doesn't necessarily need to flash new firmware in order to maintain persistence or act as part of a botnet, access to the Azure cloud would still be available since device attestation depends on image signatures (which don't necessarily have to change).
In other words, it might be easier to list what we cannot do: