Gaming PC

Intel Demos Sapphire Rapids Hardware Accelerator Blocks In Action At Innovation 2022

With Intel’s annual innovation event taking place in San Jose this week, the company is looking to regain much of the technological momentum it has slowly lost over the past few years. Intel has been working hard to release new products, but the combination of delayed schedules and the inability to show products in person to an audience has taken some of the luster out of the company and its products. At its largest in-person technology event, the company shows off as much silicon as possible to convince the press, partners and customers that CEO Pat Gelsinger’s efforts have put the company on track.

Of all Intel’s struggles over the past two years, none has been a better poster child than the Sapphire Rapids server/workstation CPU. There’s nothing to write about about the latency of the yet-to-be-talked-about Sapphire Rapids, a true next-gen product from Intel that brings everything from PCIe 5 and DDR5 to CXL and offers plenty of hardware accelerators. Years late.

But then comes Sapphire Rapids. And Intel could finally see the light at the end of the tunnel of these development efforts. Scheduled for general availability in the first quarter of 2023, just over a quarter from now, Intel is finally in a position to show off Sapphire Rapids to a wider audience, or at least the press. Or, to read the issue more realistically, Intel should start seriously promoting now ahead of the launch of Sapphire Rapids and that of its competitors.

At this year’s show, Intel invited members of the press to see a live demo of its pre-production Sapphire Rapids silicon in action. The purpose of the demo is to get the press to say, “We saw it.

Sapphire Rapids not only delivers long-awaited updates to the CPU’s processor cores, but also adds/integrates dedicated accelerator blocks for several common CPU-critical server/workstation workloads. Simply put, the idea is that fixed-function silicon can perform tasks as fast as or faster than a CPU core, using a fraction of the power and a fraction of the die size increase. Also, as hyperscalers and other server operators seek significant improvements in computing density and energy efficiency, such domain-specific accelerators are an important step for Intel to provide such advantages to its customers. Excellent method. It also doesn’t matter that rival AMD is not expected to have a similar accelerator block.

Sapphire Rapids Silicon Overview

Before we go any further, let’s take a quick look at the Sapphire Rapids silicon.

For demonstration purposes (and eventual reviewer use), Intel has assembled several dual-socket Sapphire Rapids systems using pre-production silicon. For photo purposes, they opened up one of his systems and took out his CPU.

At this point, there isn’t much that can be said about silicon other than the fact that it works. Since it’s still in the pre-production stage, Intel hasn’t revealed any clock speeds, model numbers, or what errata might have caused it to be non-final silicon. But what we do know is that these chips power 60 CPU cores and run the accelerator block that was the subject of today’s demo.

Sapphire Rapids Accelerators: AMX, DLB, DSA, IAA, and AMX

Server CPUs ship with four dedicated accelerators within each CPU tile, with the exception of the AVX-512 unit in the Sapphire Rapids CPU core.

These are Intel Dynamic Load Balancer (DLB), Intel Data Streaming Accelerator (DSA), Intel In-Memory Analytics Accelerator (IAA), and Intel QuickAssist Technology (QAT). They all hang on the chip mesh as dedicated devices, essentially acting as PCIe accelerators integrated into the CPU silicon itself. This means that accelerators do not consume CPU core resources (memory and I/O are separate issues), but it also means that the number of available accelerator cores is not directly proportional to the number of CPU cores. increase.

Of these, all but QAT are new to Intel. QAT is an exception, the previous generation of that technology was implemented in his PCH (chipset) used for 3.rd As of generation Xeon (Ice Lake-SP) processors, and Sapphire Rapids, integrated into the CPU silicon itself. So Intel’s implementation of domain-specific accelerators is nothing new, but the company is fully committed to his Sapphire Rapids idea.

All of these dedicated accelerator blocks are designed to offload a specific set of high-throughput workloads. For example, DSA speeds up data copying and simple calculations such as calculating CRC32. QAT, on the other hand, is both a cryptographic acceleration block and a data compression/decompression block. IAA does the same, compressing and decompressing data on the fly so that large databases (that is, big data) can be kept in memory in a compressed form. Finally, DLB, which Intel isn’t demonstrating today, is a block for speeding up load balancing between servers.

Finally, there is the Advanced Matrix Extension (AMX), Intel’s previously announced matrix math execution block. Similar to tensor cores and other types of matrix accelerators, these are ultra-dense blocks for efficiently performing matrix operations. Also, unlike other types of accelerators, AMX is not a dedicated accelerator, but a subset of CPU cores from which each core gets a block.

AMX is Intel’s take on the deep learning market, using even denser data structures to far exceed the throughput currently achievable with AVX-512. Intel plans to offer more GPUs than this, but for Sapphire Rapids, Intel is trying to cater to a customer segment that needs to do AI inference very close to the CPU cores rather than less flexible and dedicated accelerators. and

demo

In today’s press demo, Intel assembled a test team to take advantage of the new accelerator and showcase a series of real-world demos that can be benchmarked to show performance. For this reason, Intel has taken the advantage of non-accelerated (CPU) operations on its own Sapphire Rapids hardware (that is, why these styles of workloads should use accelerators) and the same workloads as I was looking to demonstrate the performance benefits of running on Sapphire Rapids. Here’s his EPYC (Milan) CPU from rival AMD.

Of course, Intel already runs data under the hood. So the purpose of these demos, in addition to revealing these performance numbers, was to show that the numbers are real and how we got them. wants to do its best. But I’m doing so on a workload that (to me) seems like a reasonable task for testing, using real silicon and real servers.

Demo of QuickAssist technology

The first was a demo of the QuickAssist Technology (QAT) accelerator. Intel started with his NGINX workload measuring OpenSSL cryptographic performance.

Aiming for near-identical performance, Intel was able to achieve about 66K connections per second on its Sapphire Rapids servers. We used the QAT accelerator and 11 out of 120 CPU cores (2×60) to process the non-accelerated bits of the demo. This compares to the Sapphire Rapids requiring 67 cores to achieve the same throughput without QAT acceleration at all, or the dual socket EPYC 7763 server requiring 67 cores.

A second QAT demo measured compression/decompression performance on the same hardware. As expected from a dedicated accelerator block, this benchmark was a disaster. QAT hardware accelerators have overtaken CPUs, and with Intel’s highly optimized ISA-L libraries, they have overtaken CPUs. On the other hand, this was a task that was almost completely offloaded, so it was consuming 4 CPU cores’ time for all 120/128 CPU cores in the software workload.

In-memory analytics accelerator demo

The second demo was the In-Memory Analytics Accelerator demo. Despite the name, it doesn’t actually accelerate the actual analysis part of the task. Rather, it is a compression/decompression accelerator intended for use with databases so that it can operate in memory without a huge CPU performance cost.

Running the demo on ClickHouse DB shows that in this scenario the Sapphire Rapids system has a 59% query per second performance advantage compared to the AMD EPYC system (Intel did not run the software only Intel setup) , showing a decrease in memory bandwidth. Reduced usage and overall memory usage.

The second IAA demo was set against RocksDB using the same Intel and AMD systems. Intel has again demonstrated that the IAA-accelerated SPR system is far superior, with 1.9x performance and almost half the latency.

Advanced matrix extension demo

The final demo station set up by Intel was configured to showcase Advanced Matrix Extensions (AMX) and Data Streaming Accelerator (DSA).

Starting with AMX, Intel used TensorFlow and ResNet50 neural networks to run image classification benchmarks. This test used non-CPU accelerated FP32 operations, AVX-512 accelerated his INT8 with Sapphire Rapids, and finally he also used AMX accelerated INT8 with Sapphire Rapids. did.

This was another blow for Accelerator. Thanks to his AMX block of CPU cores, the Sapphire Rapids system delivered a performance improvement of just under 2x over AVX-512 VNNI mode with a batch size of 1, and he more than doubled with a batch size of 16. This scenario looks even more favorable for Intel compared to his EPYC CPUs, as current Milan processors do not offer his AVX-512 VNNI. The overall performance gain here isn’t as great as going from a pure CPU to his AVX-512, but the AVX-512 was already (among other things) on its way to being a matrix acceleration block all by itself.

Data streaming accelerator demo

Finally, Intel demonstrated their Data Streaming Accelerator (DSA) block. This is back to the Sapphire Rapids dedicated accelerator block exhibit. For this test, Intel set up a network transfer demo using his FIO to have the client read data from his Sapphire Rapids server. Here DSA is used to offload the CRC32 computation used for TCP packets. This quickly adds up in terms of CPU requirements at the very high data rates Intel tested (2x100GbE connections).

A single CPU core is used here to demonstrate efficiency (and because a small number of CPU cores is enough to saturate the link), the DSA block allows Sapphire Rapids to use Intel’s optimized block was able to deliver 76% more IOPS for 128K QD64 sequential reads compared to using ISA-L library on the same workload. The difference with the EPYC system was even greater, with DSA latency well below 2000 microseconds.

A similar test was done with a smaller 16K QD256 random read run against two CPU cores. DSA’s performance advantage was not as great here (only 22% against Sapphire Rapids’ optimized software). However, it was still superior to EPYC and had lower latency.

first thought

First press demo of the Intel 4 dedicated accelerator block (and AMX).th Generation Xeon (Sapphire Rapids) CPU. We’ve seen it, it exists, and it’s just the tip of the iceberg of everything Sapphire Rapids plans to bring to customers starting next year.

Given the nature and purpose of domain-specific accelerators, nothing should come as a big surprise to the average technical reader. DSA exists to accelerate specialized workloads, especially those that are otherwise CPU and energy intensive, and that’s what Intel has done here. And with competition in the server market expected to be fierce in general CPU performance, these accelerator blocks are a way for Intel to add even more value to its Xeon processors, while AMD and other competitors It stands out from the rest. more CPU cores.

Expect to see more details on Sapphire Rapids over the next few months as Intel finally gets closer to shipping its next-generation server CPUs.

Related Articles

Back to top button