Hot Chips 2020 Live Blog: Marvell ThunderX3 (10:30am PT)

AnandTech Live Blog: The newest updates are at the top. This page will auto-update, there’s no need to manually refresh your browser.

01:59PM EDT – Q: No SVE? A: Didn’t line up with dev schedule. Better fit for next gen TX. Coming then

01:58PM EDT – Q: Who manufacture? A: TSMC 7nm

01:57PM EDT – Q&A time

01:57PM EDT – Snoop based coherence

01:56PM EDT – non-inclusive

01:56PM EDT – Logically shared but structurally distributed

01:56PM EDT – Ring cache with a column

01:56PM EDT – (that’s ~1.5x ?)

01:55PM EDT – MySQL: 1 thread to 60 cores and 240 threads offers 89x perf improvement

01:55PM EDT – (these are SPEC numbers)

01:55PM EDT – up to 2.21x going to SMT4 for MySQL, 1.28x for x264

01:54PM EDT – Here’s the thread speedup

01:53PM EDT – Arbitration to ensure high utilization

01:53PM EDT – die area impact of SMT4 is ~5%

01:52PM EDT – so four CPUs per core

01:52PM EDT – As far as OS is concerned, each thread is a full CPU

01:52PM EDT – Gains are actually better – these slides are a couple weeks old

01:52PM EDT – FP gets slightly better gain than Int

01:51PM EDT – SPECint 30% from arch, rest is from frequency increase

01:51PM EDT – Early perf measurements on TX3 silicon

01:51PM EDT – larger OoO helps 5%, reduce micro-op expansion helps 6%

01:50PM EDT – Here are all the improvements over TX2 for each change

01:50PM EDT – L2 supports strides and regen

01:50PM EDT – L1 TLB allow 0-cycle

01:49PM EDT – Scheduler – 256 entry depths of ports

01:49PM EDT – rename tries to bundle microops

01:49PM EDT – Each thread has a 32-micro-op skid buffer, supports 8x four micro-op bundles

01:48PM EDT – performance uplift on datacenter codes

01:48PM EDT – this allows use the path to walk and fetch cache lines

01:48PM EDT – decoupled fetch (TX2 did not have this)

01:47PM EDT – 64 KB I-cache

01:47PM EDT – 512 KB 8-way unified L2

01:47PM EDT – 2 are load+store, 1 is store

01:47PM EDT – 7 execution ports, 4 are HP, 3 are Load/Store

01:47PM EDT – 70 entry scheduler, 220 entry ROB

01:46PM EDT – 4 ops/cycle dispatch to scheduler

01:46PM EDT – Skid buffer is where the loop buffer is located

01:46PM EDT – most other structures are shared

01:46PM EDT – Skid buffer – private to each thread

01:45PM EDT – Most instructions map to a single micro-op

01:45PM EDT – 8x decode

01:45PM EDT – 64KB L1-I, up to 8 instructions/cycle

01:45PM EDT – Core block diagram

01:44PM EDT – SMT4

01:44PM EDT – TSMC 7nm

01:44PM EDT – 2x-3x perf over TX2 in SPEC

01:44PM EDT – Full IO, SATA, USB

01:44PM EDT – On-die monitoring and power management subsystem

01:44PM EDT – 8x DDR4-3200, 64 PCIe 4.0 lanes

01:44PM EDT – Up to 60/96 cores, Arm v8.3 with other 8.4 and 8.5 features

01:43PM EDT – ThunderX4 in the works

01:43PM EDT – Two versions of TX3 – single die and dual die

01:43PM EDT – Lots of learnings went into TX3

01:42PM EDT – Industry leading perf on bandwidth intensive workloads when launched

01:42PM EDT – Paved the way for a number of Arm Server CPU firsts

01:42PM EDT – Recap of TX2, a 32-core Arm v8.1 design with SMT4

01:41PM EDT – Sell into the same market as Intel and AMD

01:41PM EDT – Rabin Sugumar is lead architect for TX3

01:41PM EDT – ThunderX3, now owned by Marvell. They acquired Cavium for TX and TX2