Otherwise, hardware poisoning will cause a system panic. This simple harness uses debugfs to allow failures at an arbitrary page to be injected. Action Optional means that the CPU detected some form of corruption in the background and tells the OS about using a machine check exception. It is not recommended to use them for planning purposes. I guess what you’re missing is who marks the memory as poisoned. Whether or not the CPU referenced the particular word that triggered the fault, the existing MCA may consider such faults catastrophic at the task level, and so does not bother to precisely track which instruction s may have consumed the bogus data. This is an early paper about the first version of mcelog.
|Date Added:||14 June 2016|
|File Size:||37.68 Mb|
|Operating Systems:||Windows NT/2000/XP/2003/2003/7/8/10 MacOS 10/X|
|Price:||Free* [*Free Regsitration Required]|
This code snippet on the linked page illustrates some of the “action optional” machine check exceptions:.
Since page flags are currently in short supply, this choice was not made without consternation and debate by kernel hackers. System programming guide https: It is in volume 3: Introduction to platform hardware errors on modern x86 machines including detailed flows and recent improvements to the Linux x86 machine check handling, with a focus on memory errors.
Otherwise, hardware poisoning will cause a system panic. At first glance, an obvious solution for the poison handler would focus on the specific process and memory address mcw associated with the data error. Related software The mce-inject injector tool and the mce-test test suite can be used to test inector check.
In any case, this bit allows previously poisoned pages to be ignored by the handler. Ignore, failure, and delay are all similar in that the page was not completely isolated, except for flagging the page as poisoned. A CPU read, ihtel better yet, a data prefetch either triggered explicitly by an instruction or implicitly by a prefetch engine may have triggered the memory reference that triggered the MCA.
Additionally, the architecture must support data poisoning. This is an early paper about the first version of mcelog. Machine check handling on Linux paperslides for Linux Kongress Perhaps this is handled properly, but by just unmapping, arn’t you running the risk that some later memory allocation by that process might get the same virtual address and thus instead of a SIGBUS the process keeps running with corrupted memory?
This allow system soft- ware to perform recovery action on certain class of uncorrected errors and continue If I’m not mistaken, inte, the processor family this article was referring to.
This link is broken. However, memory poisoning requires both hardware and kernel support. Longer ECC provides capability to correct and detect more bites.
It depends how log the data and the ECC code are. If the detected memory error never actually corrupts executing software, then ignoring or isolating the error is the most desirable action. So there we have it. First, the offending instruction and process cannot be determined due to delays between the data error me and execution of the poison handler.
The MCA can occur on any “word”, where “word” is defined by the width of the ECC code applied at the corresponding level of memory. Posted Dec 4, 9: The handler ignores the following types of pages: But that’s not the case the article describes.
In a more serious vein, I found the article less clear and more hard to read than the usual material on the kernel page. Alternatively, memory may be occasionally “scrubbed.
For these uncorrectable errors, the hardware typically generates a mc which, in turn, causes a kernel panic. Background scrubbing works by reading memory locations, checking the ECC, and correcting correctable errors proactively before they become uncorrectable. As memory density increases, error rates also rise. Thus, when HWPOISON is coupled with the appropriate fault-tolerant processors, Linux users can enjoy systems that are more tolerant to memory errors in spite of increased memory densities.
For example, a injecttor cache line written back to main memory may have a data word error that is marked as poisoned. The following two SRAO errors are architecturally defined. Whether or not the CPU referenced the particular word that triggered the fault, the existing MCA may consider such faults catastrophic at the task level, and so does not bother to precisely track which instruction s may have consumed the bogus data.