[5.11] Exploring null pointer de-reference in io_uring_create

[5.11] Exploring null pointer de-reference in io_uring_create

I run an instance of syzkaller and it recently encountered a bug in the io_uring subsystem. The bug was discovered because of KASAN, and I learnt quite a bit about io_uring's internals while exploring the root cause.

For anyone curious, I was running the fuzzer on the stable tree's v5.11.12 tag, with these kernel config options. Until now, I had relied on the pre-determined C reproducers for the reported bugs, and shamelessly evaded the hard work in creating my own reproducer; not this time.

Looking into the bug report for the same issue in the syzbot dashboard, I could not find an entry for it. This means, the bug has not been discovered so far by syzbot – fascinating.

This made the challenge even more interesting, and I decided to explore this bug deeper. Looking at the syzbot reproducer for the bug:

Syzkaller reproducer:
# {Threaded:true Collide:true Repeat:true RepeatTimes:0 Procs:2 Slowdown:1 Sandbox:none Fault:false FaultCall:-1 FaultNth:0 Leak:false NetInjection:true NetDevices:true NetReset:true Cgroups:true BinfmtMisc:true CloseFDs:true KCSAN:false DevlinkPCI:false USB:false VhciInjection:false Wifi:false IEEE802154:false Sysctl:true UseTmpDir:true HandleSegv:true Repro:false Trace:false}
r0 = io_uring_setup(0x7994, &(0x7f0000000080)={0x0, 0x0, 0x42})
io_uring_setup(0x7eff, &(0x7f0000000000)={0x0, 0x0, 0x23, 0x0, 0x0, 0x0, r0})

When syzkaller reports only a syz-repro for a bug; it usually means there is no consistent reproducers for the bug. The next step for me was to translate syz-repro into a basic c program using syz-prog2c and try to use that as my starting point.

The C translation of that snippet was:

#define _GNU_SOURCE

#include <endian.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>
 
 static inline void __io_unaccount_mem(struct user_struct *user,
*/

static void use_temporary_dir(void)
{
  char tmpdir_template[] = "./syzkaller.XXXXXX";
  char* tmpdir = mkdtemp(tmpdir_template);
  if (!tmpdir)
    exit(1);
  if (chmod(tmpdir, 0777))
    exit(1);
  if (chdir(tmpdir))
    exit(1);
}

#ifndef __NR_io_uring_setup
#define __NR_io_uring_setup 425
#endif

uint64_t r[1] = {0xffffffffffffffff};

int main(void)
{
  syscall(__NR_mmap, 0x1ffff000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);
  syscall(__NR_mmap, 0x20000000ul, 0x1000000ul, 7ul, 0x32ul, -1, 0ul);
  syscall(__NR_mmap, 0x21000000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);
  use_temporary_dir();
  intptr_t res = 0;
  *(uint32_t*)0x20000084 = 0;
  *(uint32_t*)0x20000088 = 0x42;
  *(uint32_t*)0x2000008c = 0;
  *(uint32_t*)0x20000090 = 0;
  *(uint32_t*)0x20000098 = -1;
  *(uint32_t*)0x2000009c = 0;
  *(uint32_t*)0x200000a0 = 0;
  *(uint32_t*)0x200000a4 = 0;
  res = syscall(__NR_io_uring_setup, 0x7994, 0x20000080ul);
  if (res != -1)
    r[0] = res;
  *(uint32_t*)0x20000004 = 0;
  *(uint32_t*)0x20000008 = 0x23;
  *(uint32_t*)0x2000000c = 0;
  *(uint32_t*)0x20000010 = 0;
  *(uint32_t*)0x20000018 = r[0];
  *(uint32_t*)0x2000001c = 0;
  *(uint32_t*)0x20000020 = 0;
  *(uint32_t*)0x20000024 = 0;
  syscall(__NR_io_uring_setup, 0x7eff, 0x20000000ul);
  return 0;
}

I spun up a qemu VM and ran the reproducer against a kernel built with the same config on my laptop. The bug did not trigger, which I half expected to be the case. I tried running the C reproducer in loop and that did not trigger the bug either. I ran the reproducer against my laptop running on the same kernel config as the fuzzing system and the qemu VM, the bug triggered immediately on the first run (and every run).
Now, this was confusing – I had the same kernel config running on every system; and the bug was only being triggered on my laptop and nowhere else in a consistent manner. As a first, I thought it was because of pre-emption. Taking that guess, I performed the following commands against the qemu VM:

root@syzkaller:~# gcc hax.c -pthread -o repro 
.root@syzkaller:~# ./repro 
root@syzkaller:~# while true ; do ./repro  ; done
^C
root@syzkaller:~# while true ; do ./repro  ; done
^C
root@syzkaller:~# while true ; do ./repro  ; done
^C
root@syzkaller:~# while true ; do ./repro  ; done
^C[   41.270250] ==================================================================
[   41.272062] BUG: KASAN: null-ptr-deref in io_disable_sqo_submit+0xc7/0xe0
... truncated...
#Send a SIGINT while this is executing

Sending a signal this way, sometimes triggered the bug on the qemu based VM. I wanted to analyze this bug deeper on a qemu instance and not on my laptop; because dynamic debugging that was maybe stretching things too far.

Now that I knew that there was a way I could use Signals to my advantage, I decided to clean-up the reproducer a bit and re-write things.

#define _GNU_SOURCE

#include <endian.h>
#include <stdint.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>

#ifndef __NR_io_uring_setup
#define __NR_io_uring_setup 425
#endif


int main(void)
{
  syscall(__NR_mmap, 0x20000000ul, 0x1000000ul, 7ul, 0x32ul, -1, 0ul);
  intptr_t res = 0;
  pid_t parent = getpid();
  *(uint32_t*)0x20000084 = 0;
  *(uint32_t*)0x20000088 = 0x42;
  *(uint32_t*)0x2000008c = 0;
  *(uint32_t*)0x20000090 = 0;
  *(uint32_t*)0x20000098 = -1;
  *(uint32_t*)0x2000009c = 0;
  *(uint32_t*)0x200000a0 = 0;
  *(uint32_t*)0x200000a4 = 0;
  //printf("Preparing syscall 1\n");
  if (fork() == 0) {
    kill(parent,SIGKILL);
    exit(0);
  }
  res = syscall(__NR_io_uring_setup, 0x7994, 0x20000080ul);
  printf("Syscall 1 ended with res %ld", res);
  return 0;
}

After many many hours, I got an extremely consistent reproducer. Let me take a moment to explain how the reproducer works. The parent process prepares a struct with values to be fed into io_uring_setup syscall. The parent forks, and the only job of the child process is to send SIGKILL to the parent. This should happen while the parent process is in the kernel mode executing the syscall.

BUG: KASAN: null-ptr-deref in io_sq_offload_start fs/io_uring.c:8254 [inline]
BUG: KASAN: null-ptr-deref in io_disable_sqo_submit fs/io_uring.c:8999 [inline]
BUG: KASAN: null-ptr-deref in io_uring_create+0x1275/0x22f0 fs/io_uring.c:9824
Read of size 8 at addr 0000000000000068 by task syz-executor.0/4350

CPU: 0 PID: 4350 Comm: syz-executor.0 Not tainted 5.11.12 #9
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
Call Trace:
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0xc2/0x111 lib/dump_stack.c:120
 __kasan_report+0x165/0x220 mm/kasan/report.c:400
 kasan_report+0x51/0x70 mm/kasan/report.c:413
 check_memory_region_inline mm/kasan/generic.c:176 [inline]
 __asan_load8+0x94/0xb0 mm/kasan/generic.c:252
 io_sq_offload_start fs/io_uring.c:8254 [inline]
 io_disable_sqo_submit fs/io_uring.c:8999 [inline]
 io_uring_create+0x1275/0x22f0 fs/io_uring.c:9824
 io_uring_setup fs/io_uring.c:9852 [inline]
 __do_sys_io_uring_setup fs/io_uring.c:9858 [inline]
 __se_sys_io_uring_setup fs/io_uring.c:9855 [inline]
 __x64_sys_io_uring_setup+0x185/0x1e0 fs/io_uring.c:9855
 do_syscall_64+0x37/0x80 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x469c1d

Now that we have a reproducer that is consistent, I moved on to exploring the fs/io_uring.c code to look at the flow of execution.

io_uring_setup calls io_uring_create, which has a subsequent call to io_sq_offload_create. This function verifies the capabilities, fills up the struct sqd with sqd = io_get_sq_data(p);. At this point, the contents of ctx->sq_data is null.
Later in the function, ctx->sq_data = sqd; is executed which is responsible to initialise the sq_data struct within ctx. Note, at this point, there is no thread created by io_uring to perform operations. Therefore, sqd->thread == NULL.

The next few lines of the kernel code look like:

} else {
			sqd->thread = kthread_create(io_sq_thread, sqd,
							"io_uring-sq");
		}
		if (IS_ERR(sqd->thread)) {
			ret = PTR_ERR(sqd->thread);
			sqd->thread = NULL;
			goto err;

The kthread_create is responsible for populating sqd->thread. kthread_create fails because we sent an explicit SIGKILL beforehand using our child process.

	if (unlikely(wait_for_completion_killable(&done))) {
		/*
		 * If I was SIGKILLed before kthreadd (or new kernel thread)
		 * calls complete(), leave the cleanup of this structure to
		 * that thread.

Ultimately, sqd->thread is again NULL, and  and we return back to err portion of io_uring_create.
This is where our bug is exposed, the function call in err is io_disable_sq_submit that has an unsanitised call to io_sq_offload_start. io_sq_offload_start is responsible to work with the thread that was previously initialised by kthread_create and wake it up. Unfortunately, as we can see in this case, sq_data is NULL.

static void io_sq_offload_start(struct io_ring_ctx *ctx)
{
	struct io_sq_data *sqd = ctx->sq_data;

	if ((ctx->flags & IORING_SETUP_SQPOLL) && sqd->thread)
		wake_up_process(sqd->thread);
}

As sqd is NULL, sqd->thread points to the pointer 0x68 (because of alignment in the struct sq_data)

I would assume, the right way to address this bug would be to not call io_sq_offload_start if ctx->sq_data is NULL.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 8b4213de9e08..00b35079b91a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -8995,7 +8995,7 @@ static void io_disable_sqo_submit(struct io_ring_ctx *ctx)
 {
        mutex_lock(&ctx->uring_lock);
        ctx->sqo_dead = 1;
-       if (ctx->flags & IORING_SETUP_R_DISABLED)
+       if (ctx->flags & IORING_SETUP_R_DISABLED && ctx->sq_data)
                io_sq_offload_start(ctx);
        mutex_unlock(&ctx->uring_lock);
 

Verifying the patch:

root@syzkaller:~# gcc hax.c -pthread -o repro 
root@syzkaller:~# ./repro 
Preparing syscall 1
Killed

Thanks for reading.

Picture attribution to scylladb.com

Show Comments