Linux Kernel 6.1 Introduces Faulty CPU Core Detection Logging Feature
reported that phonics, Linux kernel version 6.1 introduces a new logging system for identifying bad CPUs and their associated cores in a server. A logging system can detect exactly which core, CPU, and socket failed at a given time.
This is not a fully automated system, it is for logging only. It does not tax the CPU to check for faults. As a result, Rik Van Riel, who is in charge of approving the CPU logging system for 6.1, says that system administrators can enable loggers to see which cores are bad on known bad systems. He states that he would want to run common kernel code that is known to cause failures. .
The logger is not perfect as the kernel task may be rescheduled to a different CPU or CPU core, but we have found this strategy to be sufficient for finding bad CPUs or cores. CPU failures can often be “weirdly specific” where a particular program or piece of code crashes only the core.
The program isn’t really designed for consumers, but is primarily intended for system administrators running a host of Linux-based servers. For those admins, this new tool is extremely useful in tracking down mysterious hardware failures while serious CPU stress testers like Prime95 and Aida64 are perfectly stable.
Error checkers like these and Intel’s new In-Field-Scan technology continue to grow in popularity in the server industry. As CPUs get smaller and edge nodes bleed more, the chances of errors in silicon (commonly known as soft errors) increase.
In theory, CPUs should become more susceptible to errors, especially cosmic rays, as they approach the physical limits of transistor size (e.g., 1 nm or less). As a result, CPU error checking becomes exponentially more important as transistor density continues to improve over time.