Gaming PC

AMD’s EPYC Rome Chips Crash After 1,044 Days of Uptime

AMD’s Latest Processor Revision Guide for EPYC 7002 ‘Rome’ Server Chips Reveals Interesting New Bug (Errata) May Cause Cores on Chip to Hang After 1,044 Days (~2.93 Years) of Uptime became. This means that it should be reset. Make the chip’s server work properly. AMD says it will not fix this issue.

AMD’s description of the issue affecting 2nd generation EPYC processors (AMD’s 4th generation Genoa chips are the latest) is brief, but there’s a lot more to unravel.

(Image credit: AMD)

The issue stems from the core not being able to come out of a CC6 sleep state, but according to AMD the timing of the failure can vary based on spread spectrum and REFCLK frequency, the latter of which is due to the chip’s ability to keep track of time. It is a reference clock useful for

reddit user Acidic migraines have a plausible theory As for exactly when the core hangs, “Despite what they say, the problem actually manifests itself at 1042 days and about 12 hours. The TSC runs at 2800 MHz, and 2800 * 10** 6 * 1042.5 days is approximately equal to 0x380000000000000. Too many zeros is no coincidence.”

The fix is ​​easy. Either reboot to reset the CPU and restart the 1,044 day “timer” before the uptime reaches his 1,044 days, or disable CC6’s sleep state.

While this bug is interesting, it is not a critical issue for the majority of users, and chip errata are not uncommon. Modern CPUs are the most complex devices mankind has built and most often come to market with numerous errata/bugs discovered during or after the final shipping revision (stepping) of the chip.

With billions of transistors at work, problems are inevitable. There are over 1000 errata and bugs in the chip that are often fixed by new steps in the chip or pre-launch firmware tweaks. These errata can contain all sorts of bugs, from security holes to misbehaving flags and cache tags, and chip makers do their best to eliminate them before they hit the market. increase.

However, some errata will always remain, even when the chip ships. For example, Intel’s 8th generation still has over 150 errata and this chip launched in his 2017. I don’t know how many errata the Rome chip had because AMD removed the list of resolved errata. . However, we do know that there are 39 errata remaining, and considering Intel’s situation, this doesn’t seem too bad.

Some errata are left unremediated simply because they are harmless, but apart from critical errata that can leave attack vectors unattended, some feature-related errata are patched. not. Chipmakers consider factors such as the severity of the errata, how easy it is to fix the issue, and whether there is enough errata to make it worth stepping up further. This is no easy task.

Why didn’t AMD discover it sooner? Well, 2.93 years is longer than any qualification cycle. AMD EPYC Rome chips were released in late 2018, so perhaps some of his AMD customers have already run into this issue.

Now, this 2.93 core crash bug is interesting, but the question is, is it really that important? Sure, security updates and maintenance are important, but should do it much has been done, many short interval.

The most realistic scenario is to use the Linux live patch feature to update without rebooting. This can certainly lead to increased uptime causing bugs. Also, servers for mission-critical applications often have extended uptime.

And some just want to join the Uptime Club and set records. To do that, you’ll need to defeat the computer it’s on. Voyager 2 spacecraft. Yes, the second one to enter interstellar space. That computer he has been running for 16,735 days (over 48 years) and has been running ever since.

In terrestrial records, 6,014 days (16 years) seems to be the record. server recordBut I’ve seen plenty of discussions about other crown contenders. (small /r/uptimeporn/reddit There are many examples of uptime extensions in the community. )

In either case, the EPYC Rome chip can’t beat that kind of record. This errata will not be fixed, so not all cores will significantly exceed his 1,044 day threshold under any circumstances. AMD’s note states that the issue persists. Perhaps the company decided it was too costly to fix this issue in silicon, or the performance overhead of fixing the microcode/firmware was too much, or they simply lacked affected customers. do not have. A worthwhile fix.

In both cases, disabling the CC6 sleep state on the server will resolve it. you sleep at night Alternatively, you can always reboot every 1,000 days.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button