1. Meldown patch: KPTI (X86 and AARCH64)
The main cause of meltdown is the speculative execution of the CPU. Essentially, it is a hardware loophole. How can we fix it from the software level?
1. The starting point of meltdown repair
In x86, each process has a page table, this page table covers the kernel space + user space address, in the user mode execution, the page table also has the kernel space address mapping, so it makes the user space illegal access kernel space address The code is speculatively executed (after receiving a page fault). So, the simplest and most effective software method for repairing vulnerabilities is to let the CPU run in user mode. The page table does not cover the kernel space address.
Therefore, in order to fix the vulnerability, add a page table, make the user state a page table, kernel state a page table, and switch the page table each time when switching from user mode to kernel mode. However, it is not feasible to completely cover the kernel address when running in user mode. Because interrupts, system calls and exceptions need to be entered and swapped out of the kernel space, the address of the switch in and out of the interface is kernel space. Therefore, the user-mode page table override is changed to user mode + kernel mode (trampoline). Once a system call occurs, the interface code is responsible for immediately switching the page table to a complete page table that covers the user mode + kernel mode; When it comes out, switch back to the page table that covers the user mode + kernel mode jump. As shown below:
It can be seen that after the meldown is repaired, if there are many system calls in the application, and the continuous interruption occurs during the operation, the switching of the page table will be completed every time the user state and the kernel state are switched, and the overhead will be great, and the performance will be reduced.
In AARCH64, there are originally two page tables. There are two registers, TTBR0 and TTBR1, on the hardware. The page table filled in TTBR0 only covers user space. The page table filled in TTBR1 only covers the kernel space. However, there is still a meltdown vulnerability in AARCH64, because TTBR1 still covers the entire kernel space and can still speculatively access all addresses in the kernel space when the user mode is switched to kernel space.
So fix the meltdown loophole in AARCH64 and take an x86-like approach so that the page table filled in TTBR1 does not cover the entire kernel address space, and introduces the third page table (trampoline), which is also pointed to by TTBR1, but this page table only points to Interrupts, system calls, and exceptions to switch in and out of the kernel code, modify the TTBR1 point when switching user mode / kernel mode. As shown in the following figure: When running in user mode (EL0), no page table can cover the entire kernel space address.
2. Security issues at the junction of the kernel and the user
i. Why check the address range? Access_ok?
There are countless ways an application can forge a pointer to a kernel state, so the kernel has a series of security issues when accessing user data.
When kernel space accesses user space, it must determine the legality of the user space address:
2. Why do you want to access_ok
For example, a data a of the kernel space is stored in the address b of the user space (*b = a;), but if you do not check access_ok, you cannot ensure that the address b must be user space, so that you can use a kernel The address of the space is disguised as a user space address and the contents of the kernel space address are modified (*k = a). There are countless ways to compromise the Linux Kernel, such as modifying the code logic to break root privileges.
There are a lot of loopholes that are born at the junction of the kernel and user. Therefore, at the junction point, you must ensure the validity of the address and use copy_from/to_user.
More security holes reference:
CVE Details
Https://Linux-Linux-Kernel
CVE-2017-5123 Vulnerability
https://github.com/nongiach/CVE/tree/master/CVE-2017-5123
Ii. copy_from/to_user APIs
3. Why u_k copy
For example, the kernel cannot directly manipulate the data of the user space. Instead, it operates a copy of the kernel space after copy_from_user. If user space data is directly manipulated and the user space program is multi-threaded, a thread can modify this data in the process of operating user space data to implement an attack on the kernel.
i. Prevent the kernel from accessing the user's PAN and SMAP
The safest method is to make userspace completely unable to access the kernel space, but this is not feasible. Sub-safety methods, such as providing a switch on the hardware, open each time you access user space in the kernel space, and close after the access, you can limit kernel space access to user space to the API copy_from/to_user. PAN in ARM and SMAP in x86 are the mechanisms that implement this function.
3. Avoid memory fragmentation
What is debris?
Internal fragmentation: Apply 32 bytes, but buddy should give 1 page -> slab.
External fragmentation : Apply 2n consecutive pages, but the system can not be satisfied due to non-continuous, although there is a lot of free memory.
Benefits of continuous memory: On the one hand, it can serve applications that require continuous memory (such as the CMA itself requires continuous memory); on the other hand, the use of MMU to support giant page technology, improve the TLB hit rate, memory access traversal page table overhead is reduced small.
The early buddy algorithm does not distinguish between the types of application memory. The red pages in the following figure are applied for non-movable pages. For example, the kernel uses kmalloc to request memory. However, many pages around are free, and red pages are not released, resulting in red pages and free white pages cannot be merged together. Therefore, in the latter part of Linux, the non-movable pages are separated from the movable pages, and the free memory is divided into removable (usually the memory of the user program), recyclable (usually the file system cache), non-removable (usually Kernel memory). This avoids mixing non-movable and movable pages together, so that the movable pages can be merged to provide more continuous memory.
4. The significance of sub-migration type
As shown in the above figure, each ZONE free memory free list, divided into UNMOVABLE, MOVABLE, RECLAIMABLE type, respectively, management, pay attention to these three types of memory has been dynamic changes. For example, when Linux is just starting up, the memory is removable by default. At this time, a kernel module invokes kmalloc to request a page of immovable memory. The Linux kernel does not apply for a page from the removable memory as a non-removable memory. It is a one-time application for a larger piece of memory for non-removable use (eg, 210 pages), so the mobile will no longer be contaminated by immovable. If 210 pages are not enough, apply again for a non-removable memory. At the same time, non-removable memory is released when it is no longer used, and it can still be used as a fallback after it is released.
As shown by the red line in the above figure, for example, if you want to apply for non-removable memory, but you do not have it at this time, you will first find out from the recyclable, and if you can not find it, you can find it from the removable one.
The CMA is special. As shown in the following figure, if it is not available, it will be retrieved from the CMA. If the CMA is also not applied, the memory will be found from the fallback table (shown in the green line above). .
(Corresponding to the first class: When the CMA memory is not in use, it will serve memory requests for removable pages.)
The "memory compact technology" in the Linux kernel is similar to disk defragmentation. When applying for a large contiguous memory or when applying a huge page, the kernel enables the Memory compaction mechanism, as shown in the following figure: The red part is the memory that is applied for, and the white part is the free memory. Memory compaction moves the front red part to the back white part, making the free memory continuous.
As shown in the figure below, after echo 1 > compact_memory, it can be seen that the low-level free memory becomes less, while the high-level free memory becomes more.
to sum up
Migrate type is an anti-fragmentation: it classifies when applying for memory and separates it from non-removable. It belongs to a preventive technique.
Compaction is de-fragmentation: Fragmentation has occurred, and finishing is done.
Shenzhen ChengRong Technology Co.,Ltd. , https://www.dglaptopstandsupplier.com