Doubled Speeds and Flexible Fabrics

admin August 2, 2022

0 7 minutes read

While still technically new, the Compute Express Link (CXL) standard for host-to-device connectivity is quickly taking hold in the server market. It is designed to provide a rich I/O feature set that builds on the existing PCI-Express standard, especially with excellent inter-device cache coherence. CXL goes from connecting the CPU to the accelerator in the server to connecting the DRAM and non-volatile storage through what is still physically a PCIe interface. It’s an ambitious yet widely-held roadmap that will make CXL the de facto advanced device interconnect standard in just three years, rivaling his Gen-Z, CCIX, and as of yesterday OpenCAPI has dropped out of the race.

The CXL consortium is quickly winning after winning the interconnect wars this week, but there is still much work for the consortium and its members to do. Product-wise, the first x86 CPUs with CXL are barely shipping. This depends a lot on the state of Intel’s Sapphire Ridge chips. And in terms of features, device vendors are demanding more bandwidth and features. Than the first 1.x release of CXL. Winning the interconnect wars has made CXL the interconnect king, but in the process, his CXL must be able to address some of the more complex use cases for which competing standards were designed. means

To that end, this week at the Flash Memory Summit 2022, the CXL Consortium will be at the show to announce the next full version of the CXL standard, CXL 3.0. Released at the end of 2020 and following the 2.0 standard that introduced features such as memory pooling and his CXL switch, CXL 3.0 focuses on significant improvements in several key areas of interconnect. First, on the physical side, CXL doubles the throughput per lane to 64 GT/s. On the other hand, on the logical side of the matter, CXL 3.0 greatly expands the standard logical capabilities, allowing for complex connectivity topologies and fabrics, and more flexible memory sharing and memory access modes within groups of CXL devices.

CXL 3.0: Built on PCI-Express 6.0

Starting with the physical side of CXL, the new version of the standard offers a long-awaited update to incorporate PCIe 6.0. Since the previous versions of CXL, 1.x and 2.0 were both built on top of PCIe 5.0, this is the first physical layer update since the introduction of CXL in 2019.

PCIe 6.0, itself a major update to the inner workings of the PCI-Express standard, doubled the amount of bandwidth available on the bus again to 64 GT/s. For an x16 card it will be 128GB/s. This is achieved by transitioning PCIe from using binary (NRZ) signaling to using quad-state (PAM4) signaling and incorporating a fixed packet (FLIT) interface, further increasing speed without the drawback of operating at higher frequencies. can be doubled. Since CXL is built on top of PCIe, the standard had to be updated to account for PCIe operational changes.

The end result of CXL 3.0 inherits the full bandwidth improvement of PCIe 6.0 and inherits all features such as Forward Error Correction (FEC) and reduces the total bandwidth of CXL compared to CXL 2.0 by 2 double it.

Notably, according to the CXL consortium, we were able to achieve all of this without increasing latency. This was one of the challenges PCI-SIG faced when designing PCIe 6.0. This is because the required error correction adds latency to the process and PCI-SIG uses a low-latency form of FEC. Still, CXL 3.0 goes one step further and tries to reduce latency, so 3.0 has the same latency as CXL 1.x/2.0.

Similar to the base PCIe .60 update, the CXL consortium has also tweaked the FLIT size. CXL 1.x/2.0 used relatively small 68-byte packets, but CXL 3.0 increases this to 256 bytes. A significantly larger FLIT size is one of the major communication changes in CXL 3.0. This gives the standard header FLIT even more bits, needed to enable the complex topologies and fabrics that the 3.0 standard introduces. As an added feature, CXL 3.0 also provides a low-latency “variant” FLIT mode that splits the CRC into 128-byte “sub-FLIT granular transfers”. It is designed to reduce physical layer store-and-forward overhead. .

In particular, the 256-byte FLIT size is consistent between PCIe 6.0 and CXL 3.0, which use 256-byte FLITs. Also like the underlying physical layer, CXL supports the new 64 GT/s transfer rate as well as the use of large FLITs of 32, 16, and 8 GT/s, and new protocol features for slower transfers. Make it available at speed. .

Finally, CXL 3.0 is fully backward compatible with previous versions of CXL. So the device and host can be downgraded as needed to match the rest of the hardware chain, but lose new features and speed in the process.

CXL 3.0 Features: Enhanced Coherency, Memory Sharing, Multilevel Topology, and Fabrics

In addition to further improving overall I/O bandwidth, the aforementioned CXL protocol changes have also been implemented to enable new features within the standard. CXL 1.x started out as a (relatively) simple host-to-device standard, but now that CXL is the main device interconnection protocol for servers, it will support more advanced devices and eventually Its functionality should be extended to accommodate more devices. Use Case.

Starting things at the functional level, the biggest news here is that the standard has updated the cache coherency protocol for devices with memory (Type-2 and Type-3 in CXL terminology). Enhanced coherency, as CXL calls it, allows the device to invalidate data cached by the host. This replaces the bias-based consistency approach used in previous versions of his CXL. share Let the host or device control access rather than control the memory space. In contrast, backward invalidation is much closer to a true shared/symmetric approach, he allows the CXL device to notify the host when the device makes changes.

Also, the inclusion of backward invalidation opens the door to new peer-to-peer connections between devices. CXL 3.0 uses extended coherency semantics to communicate device state to each other, allowing devices to directly access each other’s memory without involving the host. Skipping the host is not only faster in terms of latency, but in setups with switches it means that the device does not consume valuable host-to-switch bandwidth in requests. Topologies will be discussed later, but these changes are closely related to the larger topology, which organizes devices into virtual hierarchies so that all devices in the hierarchy can share a coherency domain.

In addition to tweaking caching features, CXL 3.0 also introduces some significant updates to memory sharing between host and device. CXL 2.0 provided memory, poolingCXL 3.0 introduced true memory sharing because multiple hosts could access the device’s memory, but each had to be allocated its own memory segment. Leverages new enhanced coherency semantics, allowing multiple hosts to have coherent copies of shared segments, with back invalidation to keep all hosts in sync when something changes at the device level is used.

Note, however, that this is not a full replacement for pooling. There are still use cases where CXL 2.0-style pooling is preferable (maintaining coherency comes with trade-offs), and CXL 3.0 supports mixing and matching the two modes as needed.

By further enhancing this improved host device capability, CXL 3.0 removes the previous limitation on the number of Type-1/Type-2 devices that can be connected downstream of a single CXL Root Port. CXL 2.0 allowed only one of these processing devices to be downstream of the Root Port, but CXL 3.0 removes these restrictions entirely. The CXL Root Port can now support a complete mix-and-match setup of Type-1/2/3 devices, depending on the system builder’s goals. Among other things, this means that multiple accelerators can be connected to a single switch, increasing density (more accelerators per host) and making new peer-to-peer forwarding features more useful.

Another major feature change in CXL 3.0 is support for multi-level switching. It builds on his CXL 2.0 which introduced support for CXL protocol switches, but only one switch could be placed between the host and its device. Multilevel switching, on the other hand, allows for multiple layers of switches (that is, switches that connect to other switches), greatly increasing the variety and complexity of supported network topologies.

Even a switch with only two tiers is flexible enough to enable non-tree topologies such as ring, mesh, and other fabric setups. Also, individual nodes can be hosts or devices, with no type restrictions.

On the other hand, for truly exotic setups, CXL 3.0 can also support spine/leaf architectures. In this case, traffic is routed through the top-level spine node. The spine node’s only job is to further route traffic to the lower-level (leaf) nodes that contain the actual hosts. /device.

Finally, all of these new memory and topology/fabric features can be used together in what the CXL consortium calls Global Fabric Attached Memory (GFAM). Simply put, GFAM takes the idea of CXL’s Memory Expansion Board (Type-3) to the next level by further isolating memory from specific hosts. In that respect, the GFAM device is functionally its own shared memory pool that the host and device can access as needed. GFAM devices can also include both volatile and non-volatile memory together, such as DRAM and Flash memory.

GFAM enables efficient support of large multi-node setups using CXL. As the consortium uses in one of its examples, GFAM enables CXL 3.0 to deliver the performance and efficiency needed to implement MapReduce on a cluster of CXL-connected machines. Of course, MapReduce is a very popular algorithm for use in accelerators, so extending CXL to better handle workloads common to clustered accelerators is a standard next step. It is clear (and probably necessary) that However, the line is a bit blurred between where local interconnects such as CXL end and where network interconnects such as InfiniBand begin.

Ultimately, the biggest differentiator is the number of nodes supported. CXL’s addressing mechanism, called port-based routing (PBR) by the consortium, supports up to 2^.¹² (4096) Device. As such, a CXL setup can only scale so far, especially since accelerators, attached memory, and other devices consume ports quickly.

In summary, the completed CXL 3.0 standard is open to the public today, the first day of FMS 2022. Officially, the consortium has not provided any guidance as to when his CXL 3.0 is coming to the device. It’s up to the equipment manufacturer. – But it’s reasonable to say that it won’t be soon. CXL 1.1 hosts have just shipped (never mind the CXL 2.0 hosts), but the actual commercialization of CXL is several years behind the standard. increase. This is typical of these large industry interconnect standards.