"Hey Quadro, bring me some water" - Apple unified memory explained
A simple guide to what it is, how it works and what are the benefits for us, professionals. Spoiler - it's big enough deal to diss Nvidia!
With the announcement of Apple's new M1 SoC lineup, aimed squarely at the professionals, it's worth discussing what makes those puppies so powerful, even at their relative infancy. I'll try to break it down in the least technical way, so everybody can understand the benefits.
What is SoC and why it matters?
SoC stands for system-on-chip and as of recently, it used to describe only smartphone processors. The idea behind an SoC is the integration of major components like CPU, GPU, RAM, I/O controller/Video encoders and decoders/Hardware security/Machine learning (AI) on a single chip or even vertically stacked on a single silicon die.
Such a close integration will allow those modules to communicate faster and more efficiently with each other than if they were separate components on a motherboard. This is the major benefit of the mobile ARM architecture, along with the provided energy efficiency. Apple has been bragging for years how their A-series chips have become faster watt-per-watt, compared to a PC CPU and their M1 series is now the living proof.
Access and bandwidth
But the key words in an ARM chip for laptop/desktop is access! In a classic PC, where every component lies in a separate motherboard slot, access is the major bottleneck. CPU access to memory, CPU access to GPU memory, GPU access to CPU and vice-versa. It needs bandwidth!
So things like Infinity Fabric, PCI Express lanes and separate GPU memory should exist. And their speed is a major limiting factor to tap in some extra performance. Worse still - PCI Express lanes are needed for the current I/O set, so it's not uncommon that your ports suffer from those limits.
The greatest roommate
While the memory of the new M1 Pro/Max is a separate entity, sitting just next to the SoC's silicon die, they share a package (and power), so the access is immediate, no matter what needs this memory. No need for anything fancy or technical - the Unified memory is within reach and can move data back and forth much faster.
Remember those pesky RAM sticks for your desktop PC? The fastest speed for DDR4 out of the box to move data between it and the CPU is around 35.2 GB/s for a super high-end pair (DDR4 4400). Compare that to 200 GB/s for the M1 Pro and 400 GB/s for the M1 Max! It's a bloodbath - there's no other word for it...
While we have to wait for DDR version 42 to catch up eventually, things are even more impressive when it comes to GPU access. Let's start with the PC world first. Imagine how a video editing runs. The CPU first receives all the instructions for the software and then offloads the data that the GPU needs to the graphics card. The graphics card then takes all that data and works on it within its own processor (the GPU) and built-in video memory.
Even if you have a processor with integrated graphics, the GPU typically maintains its own chunk of memory, as does the processor. They both work on the same data independently and then shuffle the results back and forth between their separately occupied memory. It's a process that's gotta happen multiple times per second and bandwidth, as well as access are a huge limiting factor.
Dreams do come true!
If you drop the requirement to move data back and forth, it’s easy to see how keeping everything in the same virtual engine bay could improve performance. This is the mastery of the ARM architecture and the Unified memory. Both the CPU and the GPU have equal access to the whole memory in every way, being size or speed. And since both (in ARM) are sharing an instruction set, there's no the need to go back and forth, and also there's no need for translating instructions.
In an occupied chunk of memory, the CPU and the GPU can read the same data and share access with the other modules as well, say the ProRes video encoder. And you can't go more Pro than that! Take Nvidia Quadro professional graphics card series for example. The fastest sample out there are roughly on par with those 400 GB/s for memory access, but then they top 24 gigs of memory. Two will come at 48 gigs, but then the access between GPUs and memories gets even more complicated.
If you're a creative professional (and I have no doubts that you are), at some point in your workflow you will need a lot of video memory, because working with high resolution and ever higher bitrates is getting seriously demanding. We came to the point, where a mobile architecture from Apple is able to put two Quadro cards to shame, having access to 64 gigs of video memory at a comparable bandwidth, with less latency and simpler GPU-to-CPU communication. Your Maxon Cinema 4D project can finally be completed with unprecedented graphic levels, without the need for you to make any compromises.