It’s not a real problem for a system like this. The system uses CXL. Their rant is just because they didn’t take the time to do a click down into what the specs are.
The system uses CXL/AMBA CHI specs under NVLink-C2C. This means the memory is linked both to the GPU directly as well as to the CPU.
All of their complaints are pretty unfounded in that case and they would have to rewrite any concerns taking into account those specs.
Check https://www.nvidia.com/en-us/project-digits/ which is where I did my next level dive on this.
EDIT: This is all me assuming they are talking about the bandwidth requirements of allocating all memory as being CPU allocation rather than enabling concepts like LikelyShared vs Unique.
You make a lot of good points in here but I think you are slightly off on a couple key points.
These are ARM not x64 so they use SVE2 which can technically scale to 2048 rather than 512 of AVX. Did they scale it to that, I’m unsure, existing Grace products are 4x128 so possibly not.
Second this isn’t meant to be a performant device, it is meant to be a capable device. You can’t easily just make a computer that can handle the compute complexity that this device is able to take on for local AI iteration. You wouldn’t deploy with this as the backend, it’s a dev box.
Third the CXL and CHI specs have coverage for memory scoped out of the bounds of the host cache width. That memory might not be accessible to the CPU but there are a few ways they could optimize that. The fact that they have an all in a box custom solution means they can hack in some workarounds to execute the complex workloads.
I’d want to see how this performs versus an i9 + 5090 workstation but even that is going to already go beyond the price point for this device. Currently a 4090 is able to handle ~20b params which is an order of magnitude smaller than what this can handle.