Investigating Performance of Multi-Threading on Zen 3 and AMD Ryzen 5000

One of the stories around AMD’s initial generations of Zen processors was the effect of Simultaneous Multi-Threading (SMT) on performance. By running with this mode enabled, as is default in most situations, users saw significant performance rises in situations that could take advantage. The reasons for this performance increase rely on two competing factors: first, why is the core designed to be so underutilized by one thread, or second, the construction of an efficient SMT strategy in order to increase performance. In this review, we take a look at AMD’s latest Zen 3 architecture to observe the benefits of SMT.

What is Simultaneous Multi-Threading (SMT)?

We often consider each CPU core as being able to process one stream of serial instructions for whatever program is being run. Simultaneous Multi-Threading, or SMT, enables a processor to run two concurrent streams of instructions on the same processor core, sharing resources and optimizing potential downtime on one set of instructions by having a secondary set to come in and take advantage of the underutilization. Two of the limiting factors in most computing models are either compute or memory latency, and SMT is designed to interleave sets of instructions to optimize compute throughput while hiding memory latency. 

An old slide from Intel, which has its own marketing term for SMT: Hyper-Threading

When SMT is enabled, depending on the processor, it will allow two, four, or eight threads to run on that core (we have seen some esoteric compute-in-memory solutions with 24 threads per core). Instructions from any thread are rearranged to be processed in the same cycle and keep utilization of the core resources high. Because multiple threads are used, this is known as extracting thread-level parallelism (TLP) from a workload, whereas a single thread with instructions that can run concurrently is instruction-level parallelism (ILP).

Is SMT A Good Thing?

It depends on who you ask.

SMT2 (two threads per core) involves creating core structures sufficient to hold and manage two instruction streams, as well as managing how those core structures share resources. For example, if one particular buffer in your core design is meant to handle up to 64 instructions in a queue, if the average is lower than that (such as 40), then the buffer is underutilized, and an SMT design will enable the buffer is fed on average to the top. That buffer might be increased to 96 instructions in the design to account for this, ensuring that if both instruction streams are running at an ‘average’, then both will have sufficient headroom. This means two threads worth of use, for only 1.5 times the buffer size. If all else works out, then it is double the performance for less than double the core design in design area. But in ST mode, where most of that 96-wide buffer is less than 40% filled, because the whole buffer has to be powered on all the time, it might be wasting power.

But, if a core design benefits from SMT, then perhaps the core hasn’t been designed optimally for a single thread of performance in the first place. If enabling SMT gives a user exact double performance and perfect scaling across the board, as if there were two cores, then perhaps there is a direct issue with how the core is designed, from execution units to buffers to cache hierarchy. It has been known for users to complain that they only get a 5-10% gain in performance with SMT enabled, stating it doesn’t work properly – this could just be because the core is designed better for ST. Similarly, stating that a +70% performance gain means that SMT is working well could be more of a signal to an unbalanced core design that wastes power.

This is the dichotomy of Simultaneous Multi-Threading. If it works well, then a user gets extra performance. But if it works too well, perhaps this is indicative of a core not suited to a particular workload. The answer to the question ‘Is SMT a good thing?’ is more complicated than it appears at first glance.

We can split up the systems that use SMT:

  • High-performance x86 from Intel
  • High-performance x86 from AMD
  • High-performance POWER/z from IBM
  • Some High-Performance Arm-based designs
  • High-Performance Compute-In-Memory Designs
  • High-Performance AI Hardware

Comparing to those that do not:

  • High-efficiency x86 from Intel
  • All smartphone-class Arm processors
  • Successful High-Performance Arm-based designs
  • Highly focused HPC workloads on x86 with compute bottlenecks

(Note that Intel calls its SMT implementation ‘HyperThreading’, which is a marketing term specifically for Intel).

At this point, we’ve only been discussing SMT where we have two cores per thread, known as SMT2. Some of the more esoteric hardware designs go beyond two threads-per-core based SMT, and use up to eight. You will see this stylized in documentation as SMT8, compared to SMT2 or SMT4. This is how IBM approaches some of its designs. Some compute-in-memory applications go as far as SMT24!!

There is a clear trend between SMT-enabled systems and no-SMT systems, and that seems to be the marker of high-performance. The one exception to that is the recent Apple M1 processor and the Firestorm cores.

It should be noted that for systems that do support SMT, it can be disabled to force it down to one thread per core, to run in SMT1 mode. This has a few major benefits:

It enables each thread to have access to a full core worth of resources. In some workload situations, having two threads on the same core will mean sharing of resources, and cause additional unintended latency, which may be important for latency critical workloads where deterministic (the same) performance is required. It also reduces the number of threads competing for L3 capacity, should that be a limiting factor. Also should any software be required to probe every other workflow for data, for a 16-core processor like the 5950X that means only reaching out to 15 other threads rather than 31 other threads, reducing potential crosstalk limited by core-to-core connectivity.

The other aspect is power. With a single thread on a core and no other thread to jump in if resources are underutilized, when there is a delay caused by pulling something from main memory, then the power of the core would be lower, providing budget for other cores to ramp up in frequency. This is a bit of a double-edged sword if the core is still at a high voltage while waiting for data in an SMT disabled mode. SMT in this way can help improve performance per Watt, assuming that enabling SMT doesn’t cause competition for resources and arguably longer stalls waiting for data.

Mission critical enterprise workloads that require deterministic performance, and some HPC codes that require large amounts of memory per thread often disable SMT on their deployed systems. Consumer workloads are often not as critical (at least in terms of scale and $$$), and so the topic isn’t often covered in detail.

Most modern processors, when in SMT-enabled mode, if they are running a single instruction stream, will operate as if in SMT-off mode and have full access to resources. Some software takes advantage of this, spawning only one thread for each physical core on the system. Because core structures can be dynamically partitioned (adjusts resources for each thread while threads are in progress) or statically shared (adjusts before a workload starts), situations where the two threads on a core are creating their own bottleneck would benefit having only a single thread per core active. Knowing how a workload uses a core can help when designing software designed to make use of multiple cores.

Here is an example of a Zen3 core, showing all the structures. One of the progress points with every new generation of hardware is to reduce the number of statically allocated structures within a core, as dynamic structures often give the best flexibility and peak performance. In the case of Zen3, only three structures are still statically partitioned: the store queue, the retire queue, and the micro-op queue. This is the same as Zen2.

 

SMT on AMD Zen3 and Ryzen 5000

So much like AMD’s previous Zen-based processors, the Ryzen 5000 series that uses Zen3 cores also have an SMT2 design. By default this is enabled in every consumer BIOS, however users can choose to disable it through the firmware options.

For this article, we have run our AMD Ryzen 5950X processor, a 16-core high-performance Zen3 processor, in both SMT Off and SMT On modes through our test suite and through some industry standard benchmarks. The goals of these tests are to ascertain the answers to the following questions:

  1. Is there a single-thread benefit to disabling SMT?
  2. How much performance increase does enabling SMT provide?
  3. Is there a change in performance per watt in enabling SMT?
  4. Does having SMT enabled result in a higher workload latency?*

*more important for enterprise/database/AI workloads

The best argument for enabling SMT would be a No-Lots-Yes-No result. Conversely the best argument against SMT would be a Yes-None-No-Yes. But because the core structures were built with having SMT enabled in mind, the answers are rarely that clear.

For our test suite, due to obtaining new 32 GB DDR4-3200 memory modules for Ryzen testing, we re-ran our standard test suite on the Ryzen 9 5950X with SMT On and SMT Off. As per our usual testing methodology, we test memory at official rated JEDEC specifications for each processor at hand.

Test Setup
AMD AM4 Ryzen 9 5950X MSI X570
Godlike
1.B3T13
AGESA 1100
Noctua
NH-U12S
ADATA
4×32 GB
DDR4-3200
GPU Sapphire RX 460 2GB (CPU Tests)
NVIDIA RTX 2080 Ti
PSU OCZ 1250W Gold
SSD Crucial MX500 2TB
OS Windows 10 x64 1909
Spectre and Meltdown Patched
VRM Supplimented with Silversone SST-FHP141-VF 173 CFM fans

Also many thanks to the companies that have donated hardware for our test systems, including the following:

Hardware Providers for CPU and Motherboard Reviews
Sapphire
RX 460 Nitro
NVIDIA
RTX 2080 Ti
Crucial SSDs Corsair PSUs
G.Skill DDR4 ADATA
DDR4-3200 32GB
Silverstone
Ancillaries
Noctua
Coolers

For simplicity, we are listing the percentage performance differentials in all of our CPU testing – the number shown is the % performance of having SMT2 enabled compared to having the setting disabled. Our benchmark suite consists of over 120 tests, full details of which can be found in our #CPUOverload article.

Here are the single threaded results.

Single Threaded Tests
AMD Ryzen 9 5950X
AnandTech SMT Off
Baseline
SMT On 
y-Cruncher 100% 99.5%
Dwarf Fortress 100% 99.9%
Dolphin 5.0 100% 99.1%
CineBench R20 100% 99.7%
Web Tests 100% 99.1%
GeekBench (4+5) 100% 100.8%
SPEC2006 100% 101.2%
SPEC2017 100% 99.2%

Interestingly enough our single threaded performance was within a single percentage point across the stack (SPEC being +1.2%). Given that ST mode should arguably give more resources to each thread for consistency, the fact that we see no difference means that AMD’s implementation of giving a single thread access to all the resources even in SMT mode is quite good.

The multithreaded tests are a bit more diverse:

Multi-Threaded Tests
AMD Ryzen 9 5950X
AnandTech SMT Off
Baseline
SMT On
Agisoft Photoscan 100% 98.2%
3D Particle Movement 100% 165.7%
3DPM with AVX2 100% 177.5%
y-Cruncher 100% 94.5%
NAMD AVX2 100% 106.6%
AIBench 100% 88.2%
Blender 100% 125.1%
Corona 100% 145.5%
POV-Ray 100% 115.4%
V-Ray 100% 126.0%
CineBench R20 100% 118.6%
HandBrake 4K HEVC 100% 107.9%
7-Zip Combined 100% 133.9%
AES Crypto 100% 104.9%
WinRAR 100% 111.9%
GeekBench (4+5) 100% 109.3%

Here we have a number of different factors affecting the results.

Starting with the two tests that scored statistically worse with SMT2 enabled: yCruncher and AIBench. Both tests are memory-bound and compute-bound in parts, where the memory bandwidth per thread can become a limiting factor in overall run-time. yCruncher is arguably a math synthetic benchmark, and AIBench is still early-beta AI workloads for Windows, so quite far away from real world use cases.

Most of the rest of the benchmarks are between a +5% to +35% gain, which includes a number of our rendering tests, molecular dynamics, video encoding, compression, and cryptography. This is where we can see both threads on each core interleaving inside the buffers and execution units, which is the goal of an SMT design. There are still some bottlenecks in the system affecting both threads getting absolute full access, which could be buffer size, retire rate, op-queue limitations, memory limitations, etc – each benchmark is likely different.

The two outliers are 3DPM/3DPMavx, and Corona. These three are 45%+, with 3DPM going 66%+. Both of these tests are very light on the cache and memory requirements, and use the increased Zen3 execution port distribution to good use. These benchmarks are compute heavy as well, so splitting some of that memory access and compute in the core helps SMT2 designs mix those operations to a greater effect. The fact that 3DPM in AVX2 mode gets a higher benefit might be down to coalescing operations for an AVX2 load/store implementation – there is less waiting to pull data from the caches, and less contention, which adds to some extra performance.

Overall

In an ideal world, both threads on a core will have full access to all resources, and not block each other. However, that just means that the second thread looks like it has its own core completely. The reverse SMT method, of using one global core and splitting it into virtual cores with no contention, is known as VISC, and the company behind that was purchased by Intel a few years ago, but nothing has come of it yet. For now, we have SMT, and by design it will accelerate some key workloads when enabled.

In our CPU results, the single threaded benchmarks showed no uplift with SMT enabled/disabled in our real-world or synthetic workloads. This means that even in SMT enabled mode, if one thread is running, it gets everything the core has on offer.

For multi-threaded tests, there is clearly a spectrum of workloads that benefit from SMT.

Those that don’t are either hyper-optimized on a one-thread-per-core basis, or memory latency sensitive.

Most real-world workloads see a small uplift, an average of 22%. Rendering and ray tracing can vary depending on the engine, and how much bandwidth/cache/core resources each thread requires, potentially moving the execution bottleneck somewhere else in the chain. For execution limited tests that don’t probe memory or the cache at all, which to be honest are most likely to be hyper-optimized compute workloads, scored up to +77% in our testing.

For our gaming tests, we are using our AMD Ryzen 9 5950X paired with an NVIDIA RTX 2080 Ti graphics card. Our standard test suite consists of 12 titles, tested at four configurations:

  • Stage 1: Actual Gaming (1080p Maximum Quality, or equivalent)
  • Stage 2: All About Pixels (‘4K Minimum’ Quality)
  • Stage 3: Medium Low (‘1440p Minimum’)
  • Stage 4: Lowest Lows (720p Minimum or lower)

The final three settings are a set of CPU-limited gaming, and help find the limit of where we move from CPU limited to GPU limited. Some users baulk at this testing finding it irrelevant, however these configurations have been widely requested over the years. The contraire to this testing is the first setting, at 1080p Maximum: this being requested given that 1080p is the most popular gaming resolution, and Maximum Quality because this graphics card should be able to handle almost everything at that resolution at very playable framerates.

All the details for our gaming tests can be found in our #CPUOverload article.

Stage 1: Actual Gaming
AMD Ryzen 9 5950X, SMT On vs SMT Off
AnandTech Settings Average
FPS
95th
Percentile
Chernobylite 1080p Max 100%
Civilization 6 1080p Max 103%
Deus Ex: MD 1080p Max 99% 100%
Final Fantasy 14 1080p Max 102%
Final Fantasy 15 8K Standard 100% 99%
World of Tanks 1080p Max 100% 102%
World of Tanks 4K Max 103% 102%
Borderlands 3 1080p Max 101% 103%
F1 2019 1080p Ultra 103% 106%
Far Cry 5 1080p Ultra 104% 104%
GTA V 1080p Max 99% 100%
RDR 2 1080p Max 100% 100%
Strange Brigate 1080p Ultra 101% 101%

In real-world gaming situations, there’s very little to pick between having SMT enabled or disabled. Almost universally it is either beneficial or a smidgen better to have it enabled, with F1 2019, Civilization 6, and Far Cry 5 seemingly the best recipients. I’ve also added in the Stage 3 result from World of Tanks, just because that benchmark doesn’t really have a proper settings menu.

Stage 2: All About Pixels
AMD Ryzen 9 5950X, SMT On vs SMT Off
AnandTech Settings Average
FPS
95th
Percentile
Chernobylite 4K Low 99%
Civilization 6 4K Min 105%
Deus Ex: MD 4K Min 98% 100%
Final Fantasy 14 4K Min 102%
Final Fantasy 15 4K Standard 100% 100%
Borderlands 3 4K Very Low 101% 104%
F1 2019 4K Ultra Low 100% 100%
Far Cry 5 4K Low 101% 100%
GTA V 4K Low 100% 101%
RDR 2 8K Min 100% 100%
Strange Brigate 4K Low 100% 100%

With our high resolution settings with minimal quality, there is only one outlier in Civilization 6 on the average frame rates, which seem to be a bit higher when SMT is enabled.

Stage 3: Medium Low
AMD Ryzen 9 5950X, SMT On vs SMT Off
AnandTech Settings Average
FPS
95th
Percentile
Chernobylite 1440p Low 100%
Civilization 6 1440p Min 105%
Deus Ex: MD 1440p Min 97% 96%
Final Fantasy 14 1440p Min 102%
Final Fantasy 15 1080p Standard 101% 105%
World of Tanks 1080p Standard 101% 101%
Borderlands 3 1440p Very Low 103% 105%
F1 2019 1440p Ultra Low 99% 99%
Far Cry 5 1440p Low 99% 99%
GTA V 1440p Low 100% 99%
RDR 2 1440p Low 100% 100%
Strange Brigate 1440p Low 100% 100%

At the more medium settings, we’re starting to see some more variation (Borderlands gets a few percent from SMT). We’re starting to see Deus Ex:MD drop off a bit with SMT enabled.

Stage 4: Lowest Lows
AMD Ryzen 9 5950X, SMT On vs SMT Off
AnandTech Settings Average
FPS
95th
Percentile
Chernobylite 360p Low 106%
Civilization 6 480p Min 102%
Deus Ex: MD 600p Min 91% 91%
Final Fantasy 14 768p Min 102%
Final Fantasy 15 720p Standard 99% 102%
World of Tanks 768p Min 101% 100%
Borderlands 3 360p Very Low 108% 110%
F1 2019 768p Ultra Low 102% 105%
Far Cry 5 720p Low 100% 101%
GTA V 720p Low 99% 98%
RDR 2 384p Low 100% 103%
Strange Brigate 720p Low 95% 95%

This is perhaps our most varied set of results, with Deus Ex:MD showing an almost 10% drop with SMT enabled. DEMD is usually considered a CPU title, but so is Chernobylite, which sees a 6% gain. Borderlands is +8-10% with SMT enabled, which is more of a modern game. However, I doubt anyone is playing at these resolutions.

Overall Gaming Performance

If we take full averages from all the data points, then we’re seeing a rough +1% gain in performance in the more complex scenarios across the board.

Resolution Average Comparison
AMD Ryzen 9 5950X, SMT On vs SMT Off
AnandTech Setting aka Average
FPS
95th
Percentile
Stage 1 1080p Max Actual Gaming 101% 101%
Stage 2 4K+ Min All About Pixels 101% 101%
Stage 3 1440p Min Medium Lows 101% 101%
Stage 4 < 768p Min Lowest Lows 100% 101%

In reality, any loss or gain is highly dependent on the title in question, and can swing from one side of the line to the other. It’s clear that Deus Ex prefers SMT off, and F1 2019 or Borderlands prefers SMT on, but we are talking fine margins here.

Two other arguments for having SMT enabled or disabled comes down to power consumption and temperature.

With SMT enabled, the core utilization is expected to be higher, with more instructions flowing through and being processed per cycle. This naturally increases the power requirements on the core, but might also reduce the frequency of the core. The trade-off is meant to be that the work going through the core should be more than enough to make up for extra power used, or any lower frequency. The lower frequency should enable a more efficient throughput, assuming the voltage is adjusted accordingly.

This is perhaps where AMD and Intel differ slightly. Intel’s turbo frequency range is hard-bound to specific frequency values based on core loading, regardless of how many threads are active or how many threads per core are active. The activity is a little more opportunistic when we reach steady state power, although exactly how far down the line that is will depend on what the motherboard has set the power length to. AMD’s frequency is continually opportunistic from the moment load is applied: it obviously scales down as more cores are loaded, but it will balance up and down based on core load at all times. On the side of thermals, this will depend on the heat density being generated in each core, but this also acts as a feedback loop into the turbo algorithm if the power limit has not been reached.

For our analysis here, we’ve picked two benchmarks. Agisoft, which is a variable threaded test performs practically the same with SMT On/Off, and 3DPMavx, a pure MT test which gets the biggest gain from SMT.

Agisoft

Photoscan from Agisoft is a 2D image to 3D model creator, using dozens of high-quality 2D images to generate related point maps to form a 3D model, before finally texturing the model using the images provided. It is used in archiving artefacts, as well as converting 2D sculpture into 3D scenes. Our test analyses a standardized set of 85 x 18 megapixel photos, with a result measured in time to complete.

Simply looking at CPU temperatures while running our real-world Agisoft test, our current setup (MSI X570 Godlike with Noctua NH12S) shows that both CPUs will flutter around 74ºC sustained. Perhaps the interesting element is at the beginning of the test, where the CPU temperatures are higher in SMT Off mode. Looking into the data, and during SMT Off, the processor is at 4300 MHz, compared to 4150 MHz when SMT is enabled. This would account for the difference.

Looking at power, we can follow that for the bulk of the test, both processors have similar package power consumption, around 130 W. The SMT Off is drawing more power during the first couple of minutes of the test, due to the higher frequency. Clearly the thermal density in this part of the test by only having one thread per core is allowing for a higher turbo.

If we measure the total power of the test, it’s basically identical in any metric that matters. Nearer the end of the test, where the workload is more variably threaded, this is where the SMT Off mode seems to come under power. This benchmark completion time is essentially the same due to the nature of the test, but SMT Off comes in at 2% lower power overall.

3DPMavx (3D Particle Movement)

Our 3DPM test is an algorithmic sequence of non-interactive random three-dimensional movement, designed to simulate molecular diffusive movement inside a gas or a fluid. The simulation is made non-interactive (i.e. no two molecules will collide) due to the original average movement of each particle taking collisions into account. Our test cycles through six movement algorithms at ten seconds apiece, followed by ten seconds of idle, with the whole loop being repeated six times, taking about 20 minutes, regardless of how fast or slow the processor is. The related performance figure is millions of particle movements per second. Each algorithm has been accelerated for AVX2.

On the temperature side of things, it is clear that the SMT Off mode again puts up a higher thermal profile. Temperatures this time peak at 66ºC, but it is clear the difference between the two modes.

On the power side, we can see why SMT Off mode is warmer – the cores are drawing more power. Looking at the data, SMT Off mode is running ~4350 MHz, compared to SMT On which is running closer to 4000 MHz.

With the higher frequency with SMT Off, the estimated total power consumption is 6.8% higher. This appears to be very constant throughout the benchmark, which lasts about 20 minutes total.

But, let us add in the performance numbers. Because 3DPMavx can take advantage of SMT On, that mode scores +77.5% by having two threads per core rather than one (a score of 10245 vs 5773). Combined this makes SMT On mode +91% better in performance per watt on this benchmark.

I wasn’t too sure what we were going to see when I started this testing. I know the theory behind implementing SMT, and what it means for the instruction streams having access to core resources, and how cores that have SMT in mind from the start are built differently to cores that are just one thread per core. But theory only gets you so far. Aside from all the forum messages over the years talking about performance gains/losses when a product has SMT enabled, and the few demonstrations of server processors running focused workloads with SMT disabled, it is actually worth testing on real workloads to find if there is a difference at all.

Results Overview

In our testing, we covered three areas: Single Thread, Multi-Thread, and Gaming Performance.

In single threaded workloads, where each thread has access to all of the resources in a single core, we saw no change in performance when SMT is enabled – all of our workloads were within 1% either side.

In multi-threaded workloads, we saw an average uplift in performance of +22% when SMT was enabled. Most of our tests scored a +5% to a +35% gain in performance. A couple of workloads scored worse, mostly due to resource contention having so many threads in play – the limit here is memory bandwidth per thread. One workload scored +60%, a computational workload with little-to-no memory requirements; this workload scored even better in AVX2 mode, showing that there is still some bottleneck that gets alleviated with fewer instructions.

On gaming, overall there was no difference between SMT On and SMT Off, however some games may show differences in CPU limited scenarios. Deus Ex was down almost 10% when CPU limited, however Borderlands 3 was up almost 10%. As we moved to a more GPU limited scenario, those discrepancies were neutralized, with a few games still gaining single-digit percentage points improvement with SMT enabled.

For power and performance, we tested two examples where performance at two threads per core was either saw no improvement (Agisoft), or significant improvement (3DPMavx). In both cases, SMT Off mode (1 thread/core) ran at higher temperatures and higher frequencies. For the benchmark per performance was about equal, the power consumed was a couple of percentage points lower when running one thread per core. For the benchmark were running two threads per core has a big performance increase, the power in that mode was also lower, and there was a significant +91% performance per watt improvement by enabling SMT.

What Does This Mean?

I mentioned at the beginning of the article that SMT performance gains can be seen from two different viewpoints.

The first is that if SMT enables more performance, then it’s an easy switch to use, and some users consider that if you can get perfect scaling, then if SMT is an effective design.

The second is that if SMT enables too much performance, then it’s indicative of a bad core design. If you can get perfect scaling with SMT2, then perhaps something is wrong about the design of the core and the bottleneck is quite bad.

Having poor SMT scaling doesn’t always mean that the SMT is badly implemented – it can also imply that the core design is very good. If an effective SMT design can be interpreted as a poor core design, then it’s quite easy to see that vendors can’t have it both ways. Every core design has deficiencies (that much is true), and both Intel and AMD will tell its users that SMT enables the system to pick up extra bits of performance where workloads can take advantage of it, and for real-world use cases, there are very few downsides.

We’ve known for many years that having two threads per core is not the same as having two cores – in a worst case scenario, there is some performance regression as more threads try and fight for cache space, but those use cases seem to be highly specialized for HPC and Supercomputer-like tasks. SMT in the real world fills in the gaps where gaps are available, and this occurs mostly in heavily multi-threaded applications with no cache contention. In the best case, SMT offers a sizeable performance per watt increase. But on average, there are small (+22% on MT) gains to be had, and gaming performance isn’t disturbed, so it is worth keeping enabled on Zen 3.