Quick notes on KERNSEAL - SIT-CyberSecurity

The mysterious unreadable `kernseal.txt` file on PaX’ documentation
page has been sitting there since
2003, described as “sealed kernel storage design & implementation.” In 2006, it
was described
as:

> the problem KERNSEAL sets out to solve is kernel self-protection, that is, assuming arbitrary read/write access to kernel memory (by some bug, but for all i care, it could even be a mode 777 /dev/mem as well), the goal is to prevent privilege elevation (vs. privilege abuse which is an even harder problem to solve).

After many years of `KERNSEAL ETA WEN` jokes on `#grsecurity`, it was finally
made available to grsecurity beta customers in August
2023 and
to LTS ones in January
2024. I was eagerly
expecting minipli’s blogpost on the topic,
but since none got published so far, I endeavoured to read an old patch a
friend of mine was kind enough to sling my way, and take/publish some
high-level notes while waiting for it, as
apparently the
diff for `pax-linux-6.2.13-test6-kernseal-only.patch` is “only” `66 files
changed, 1118 insertions(+), 361 deletions(-)`. Odds are that most of my
understanding is completely wrong nonetheless, so take everything written here
with a mountain of salt.

The main idea behind `PAX_KERNSEAL` seems to be the _constification_ of
dynamically allocated objects, a bit like
`PAX_CONSTIFY_PLUGIN`
is doing for static ones, as well as completely hiding some of them as well.
It depends on a couple of things to enforce its
security invariants:

– PaX’ RAP, to prevent existing code out-of-(intended)-order execution, otherwise an attacker could simply ROP their way around KERNSEAL.
– `PAX_PRIVATE_KSTACKS`, to defend against kthreads manipulating each other return addresses after RAP checks.
– `CONFIG_PAX_PER_CPU_PGD`, to prevent other kthreads from accessing temporarily unsealed pages on a given CPU.
– `CONFIG_PAGE_TABLE_ISOLATION`, of course.
– Not having hibernation nor kexec support.

It introduces two new page states via
GFP
flags, and stores those properties in the `struct page`:

1. `__GFP_SEALED`/ `PG_sealed`: The page is mapped **read-only** in the direct map (the linear mapping of all physical memory.)
2. `__GFP_HIDDEN`/ `PG_hidden`: The page is mapped **invalid**(completely unmapped) in the direct map, so contents can’t be read or written through the normal direct-map address.

New corresponding migrate types ( `MIGRATE_SEALED` and `MIGRATE_HIDDEN`) are
added to the buddy allocator, ensuring that sealed and hidden pages are grouped
together in dedicated pageblocks. Of course those types are non-mergeable,
preventing the allocator from stealing sealed/hidden blocks for normal
allocations.

When a pageblock is set up for sealed or hidden use, `pax_setup_pageblock()`
walks the PMD entries in the direct map and applies
`pmd_wrprotect()` (for sealed) or `pmd_mkinvalid()` (for hidden), followed by a
TLB flush. After allocation, `post_alloc_hook()` verifies that page flags match
the requested GFP flags (sealed pages must have `PG_sealed`, etc.), and updates
per-node statistics ( `NR_SEALED`, `NR_HIDDEN`).

As hidden pages have no valid direct-map mapping, the kernel needs a way to
temporarily access them, which is done via `pax_expose_page`/ `pax_hide_page`
pair, a bit like
`KERNEXEC`’s
`pax_open_kernel`/ `pax_close_kernel` are doing to keep the kernel code
read-only.

A dedicated `KM_USER_SLOT` is reserved for KERNSEAL kmap operations, and every
`kmap`-related call is hooked: if the page is hidden, it goes through
`pax_expose_page()` to create a temporary per-CPU mapping; if sealed, access is
blocked entirely with `VM_BUG_ON_PAGE_ALWAYS`, dumping the page and calling
`BUG()`.

A new kmalloc cache type ( `KMALLOC_SEALED`) is added, to allow the kernel to
allocated sealed data on a lower granularity than page-level. Temporarily
unseal capability (for initialization for example) is provided by
`pax_open_seal()`/ `pax_close_seal()`, which are simply wrappers around
`pax_open_kernel` and `pax_close_kernel`.

The most obvious usage of `KERNSEAL` on the patch I have is on `struct cred`:
The mutable fields ( `usage`, `rcu`, `non_rcu`) are split into a separate
`struct cred_rw`, while the `cred` structure itself is marked
`__mutable_const`, with the `rw` portion being actually a pointer to
separately-allocated mutable memory:

“`
*/ struct cred { – atomic_long_t usage; + struct cred_rw { + /* RCU deletion */ + union { + int non_rcu; /* Can we skip RCU deletion? */ + struct rcu_head rcu; /* RCU deletion hook */ + }; +#ifdef CONFIG_PAX_KERNSEAL + struct cred *cred; +#endif + atomic_long_t usage; + } +#ifdef CONFIG_PAX_KERNSEAL + *rw; +#else + _rw; +#endif // […] +#ifdef CONFIG_PAX_KERNSEAL +} __randomize_layout __mutable_const; +#else } __randomize_layout; +#endif
“`

and used like this:

“`
+#ifdef CONFIG_PAX_KERNSEAL +#define to_cred_rw(cred) (cred->rw) +#define to_cred(cred_rw) (cred_rw->cred) +#else +#define to_cred_rw(cred) (&cred->_rw) +#define to_cred(cred_rw) (container_of(cred_rw, struct cred, _rw)) +#endif +
“`

This means the credential’s security-sensitive fields (UIDs, GIDs, capabilities) live on sealed pages and cannot be tampered with, while the reference count and RCU linkage live on normal writable memory.

Debugging-wise, When a page fault occurs on a direct-mapped address, the fault handler checks whether the page is sealed or hidden and provides a clear diagnostic:

“`
+ if (is_direct_mapped_addr((void *)address)) { + struct page *page = virt_to_page((void *)address); + if (PageSealed(page) || PageHidden(page)) { + pr_alert(“BUG: unable to handle page fault for %s page at %pSn”, + PageSealed(page) ? “sealed” : “hidden”, (void *)address);
“`

Moreover, sealed and hidden page counts are exposed via `/proc/meminfo` and
per-node `meminfo`, plus per-process stats in `/proc//smaps`:

– `Sealed:`— total sealed pages in kB
– `Hidden:`— total hidden pages in kB

New `kpageflags` bits ( `KPF_SEALED` = 62, `KPF_HIDDEN` = 63) are also exported.

As for `PAX_PRIVATE_KSTACKS` (in the context of `PAX_KERNSEAL`), it creates a
per-CPU page table where each task gets a dedicated slot with guard pages. Only
the current task’s stack is accessible, via dynamic PTE-level (un)mapping
magic. Underlying physical pages are allocated with `__GFP_HIDDEN` of course.
For stack variables that require DMA/async access, an ad-hoc GCC plugin
identifies them and stores them in a per-task dedicated page.

Even though `PAX_PRIVATE_KSTACKS` and `PAX_KERNSEAL` are conceptually simple
mitigation, they are likely super-tedious to apply to the Linux kernel code
behemoth. Tackling data-only attacks is hard, and the only other people
seriously trying to address them is Apple, with their hardware-based
KTRR/CTRR/GXF/APRR/PPL/SPTM/TXM mitigations. This makes KERNSEAL all the more
remarkable, as like everything produced by the PaX Team, it doesn’t require
special hardware support.

Another interesting property of KERNSEAL is that it can serve as a basis for other interesting things, like ensuring that no guest pages are available at the hypervisor level in KVM for example. I can’t wait to see what will be built on top next.

All in all, unsurprisingly, KERNSEAL is yet another all-around tour de force from the PaX Team, who keeps consistently producing stellar software-only mitigations before everyone else, since almost 25 years.

Recent posts