Skip to content
  • Konstantin Belousov's avatar
    PTI for amd64. · bd50262f
    Konstantin Belousov authored
    The implementation of the Kernel Page Table Isolation (KPTI) for
    amd64, first version. It provides a workaround for the 'meltdown'
    vulnerability.  PTI is turned off by default for now, enable with the
    loader tunable vm.pmap.pti=1.
    
    The pmap page table is split into kernel-mode table and user-mode
    table. Kernel-mode table is identical to the non-PTI table, while
    usermode table is obtained from kernel table by leaving userspace
    mappings intact, but only leaving the following parts of the kernel
    mapped:
    
        kernel text (but not modules text)
        PCPU
        GDT/IDT/user LDT/task structures
        IST stacks for NMI and doublefault handlers.
    
    Kernel switches to user page table before returning to usermode, and
    restores full kernel page table on the entry. Initial kernel-mode
    stack for PTI trampoline is allocated in PCPU, it is only 16
    qwords.  Kernel entry trampoline switches page tables. then the
    hardware trap frame is copied to the normal kstack, and execution
    continues.
    
    IST stacks are kept mapped and no trampoline is needed for
    NMI/doublefault, but of course page table switch is performed.
    
    On return to usermode, the trampoline is used again, iret frame is
    copied to the trampoline stack, page tables are switched and iretq is
    executed.  The case of iretq faulting due to the invalid usermode
    context is tricky, since the frame for fault is appended to the
    trampoline frame.  Besides copying the fault frame and original
    (corrupted) frame to kstack, the fault frame must be patched to make
    it look as if the fault occured on the kstack, see the comment in
    doret_iret detection code in trap().
    
    Currently kernel pages which are mapped during trampoline operation
    are identical for all pmaps.  They are registered using
    pmap_pti_add_kva().  Besides initial registrations done during boot,
    LDT and non-common TSS segments are registered if user requested their
    use.  In principle, they can be installed into kernel page table per
    pmap with some work.  Similarly, PCPU can be hidden from userspace
    mapping using trampoline PCPU page, but again I do not see much
    benefits besides complexity.
    
    PDPE pages for the kernel half of the user page tables are
    pre-allocated during boot because we need to know pml4 entries which
    are copied to the top-level paging structure page, in advance on a new
    pmap creation.  I enforce this to avoid iterating over the all
    existing pmaps if a new PDPE page is needed for PTI kernel mappings.
    The iteration is a known problematic operation on i386.
    
    The need to flush hidden kernel translations on the switch to user
    mode make global tables (PG_G) meaningless and even harming, so PG_G
    use is disabled for PTI case.  Our existing use of PCID is
    incompatible with PTI and is automatically disabled if PTI is
    enabled.  PCID can be forced on only for developer's benefit.
    
    MCE is known to be broken, it requires IST stack to operate completely
    correctly even for non-PTI case, and absolutely needs dedicated IST
    stack because MCE delivery while trampoline did not switched from PTI
    stack is fatal.  The fix is pending.
    
    Reviewed by:	markj (partially)
    Tested by:	pho (previous version)
    Discussed with:	jeff, jhb
    Sponsored by:	The FreeBSD Foundation
    MFC after:	2 weeks
    bd50262f