Mac M3 vs. 2 x NVIDIA 4090: A Performance Showdown in Llama 3 Inference

Nov 18, 2024

Results of a Quick Stress Test: Llama 3 8B on Mac M3 (128GB RAM) vs. 2 x NVIDIA 4090s.

The tests were performed using the Ollama inference engine.

Llama 3 8B Performance

Mac M3:

Time to First Token: 0.90 seconds
Tokens per Second: 58.17

2 x NVIDIA 4090:

Time to First Token: 0.18 seconds
Tokens per Second: 141.48

Running Llama 3 70B on the M3 is painfully slow but we can still run it on 4090s given enough memory.

Llama 3 70B Performance

2 x NVIDIA 4090:

Time to First Token: 0.10 seconds
Tokens per Second: 20.97

To run the 8B model, you need approximately 16GB of memory, which fits comfortably on a single 4090. The 70B model requires over 140GB of memory, which exceeds the capacity of two 4090s. However, thanks to the fast PCIe interface and 128GB of unified memory on the M3, I was able to run it across the GPU and main memory.

The M3 uses unified memory with much lower bandwidth (~200GB/s) compared to the 4090's (~1,008GB/s per GPU). Therefore, the higher latency for the M3 is understandable. The M3 operates at less than 10 TFLOPS, compared to ~80 TFLOPS (or even ~160 TFLOPS for FP16/INT8) for the 4090. Interestingly, the throughput difference is less than 3x, which is surprising.

Perhaps in the near future, building your own rig won't be necessary to achieve the best LLM performance.

(This a quick and dirty comparison, more in-depth understanding of Ollama and its memory utilization is needed to validate the observations here).

Sergey’s Substack

Discussion about this post