The Armari Magnetar X64T Workstation OC Review: 128 Threads at 4.0 GHz, Sustained!

Blitzing around a race track in a fast car only ever convinces you of one thing: I need to go around the track even faster. I need a better car, I need a better engine, better brakes, or better tires. I need that special go faster juice, and I need to nail the perfect run. The world of professional computing works the same, whether it comes down to rendering, rapid prototyping, scientific compute, medical imaging, weather modelling, or something like oil and gas simulation, the more raw horsepower there is, the more can be done. So enter the new Armari Magnetar X64T – an overclocked 64-core Threadripper 3990X that holds the new SPECworkstation3 world record. We got hold of one. It’s really fast.

Playing with Performance

AMD’s Threadripper 3990X is one of those crazy processors. It comes at you with some of the best of any processor statistics: it has 64 cores and 128 threads, it has 256 MB of L3 cache, it has a TDP of 280 W, which allows for a 2.9 GHz base frequency up to a 4.3 GHz turbo. It is overclockable, and so with the right system those frequencies can go even higher. With the best binned 7nm chiplets, paired with quad-channel DDR4-3200 memory, for multithreaded workloads it is one of the ultimate powerhouses anyone can build in a single socket with a socketable processor.

In our initial review of the Threadripper 3990X, it blitzed any software that could take advantage of all those threads – the nearest competitors were the 32-core Threadrippers, or Intel’s 28-core Xeon-W processors. We even put it up against two of Intel’s $10000 28-core Xeons, and it won pretty much everything by a large margin.

So what happens when we overclock it? There are those that want more, and not just those overclocking for fun – workstation customers, like animation studios, are always looking for ways in which they can rapidly render frames for upcoming projects. If a cooling system can be built to withstand it, and the power is available, then there’s always scope to get more out of the hardware that comes from the big players. This is what the Armari Magnetar X64T Workstation is designed to do – get more.

To that end, today AMD and SPEC is announcing that the Magnetar X64T workstation, a system that you can buy, will off-the-shelf give the best performance in SPECworkstation3 ever seen.

The key highlight from this review, should you not read any further, is that this system is built to blitz workloads. The Threadripper 3990X is usually fast enough in its own right, but Armari have gone above and beyond. The goal of this system is to be an off-the-shelf powerhouse that requires very little setup from its customers.

Armari, perhaps a lesser well known system integrator, is a company that has in recent years focused on building systems for 3D animation, video editing, and scientific research. With over 20 years of experience, Armari’s hardware has gone into high performance computing solutions and clusters that have featured in the TOP500 lists, as well as rendering server farms for the top animation, VFX, and CGI studios in Soho, London.

These are clients who want the best performance, and Armari positions itself not so much as a boutique system builder, but something between the big OEMs (like Dell/HP) and the main retailers to offer custom solutions by leveraging its network of cooling and hardware contacts around the world. This enables the company to build custom chassis, obtain optimized memory, order power supplies with custom connector configurations, and ensure consistency from batch-to-batch when ordering from its partners. In speaking to Armari’s Technical Director Dan Goldsmith, he mentioned that working with partner companies for so long has enabled them to get access to rapid prototyping and component consistency with continual feedback with partners such as EKWB, ASRock, Tyan, and many other ODM companies that Armari leverages on a regular basis. 

The Magnetar X64T, I was told, leverages the strong relationship Armari has with AMD. The Opteron was a popular range a decade ago, and that partnership has been maintained through today. The goal of the Magnetar project was to create a system that offers the best that Threadripper has to offer while still enabling the under-the-desk workstation platform. This project has been slightly delayed due to COVID, and AMD now has Threadripper Pro, but those processors are not overclockable – for those that want raw performance, AMD and Armari believe they are on a winner.

The key to the system is in how Armari is cooling the processor, and the choice of components. The Magnetar X64T features a custom water cooling loop, which is perhaps not anything new in its own right, however the company has created a component chain to ensure consistency in its design, as well as using some of the most powerful options available.

The water block is probably the best place to start, because this is a completely custom-for-Armari design built in partnership with EK Water Blocks. This block is specifically built for this one motherboard, the ASRock TRX40 Taichi, and applies cooling to both the processor and the power delivery. The block works in conjunction with the highest-quality thermal paste pads on the market, to ensure a flat connection with the water block. As it also covers the power delivery, Armari worked with ASRock to enable a consistent z-height of all the power delivery components, something that can vary during manufacturing, and maintain that consistency on a batch-by-batch basis. Pair this up with Armari’s custom FWL liquid cooling pump, reservoir, tubing, 3x140mm radiator, and fan combinations (many of which are custom from their respective ODMs), and we have a cooling capacity in excess of 700 W.  The coolant is a special long-life coolant designed for 24/7 over three years, and the standard warranty comes with service during those three years, including collection and return, at no extra cost.

Now, the ASRock TRX40 Taichi isn’t the top Threadripper motherboard on the market, and Armari fully admits that, however it points out that the best motherboard available costs twice as much. In working with ASRock, they were able to co-ordinate what was needed within the discrete motherboard component lists as well as enable a custom BIOS implementation for additional control. One of the tradeoffs I was told about is that a cheaper motherboard might mean slightly cheaper components, however Armari says that their cooling system and setup were co-operatively tuned to meet its customers’ demands.

With this cooling arrangement, Armari have fired up the overclock. In our initial review of the Threadripper 3990X, we were observing ~3450 MHz during our sustained running with the CPU reaching its full 280 W. For the Armari Magnetar X64T, we have an all-core frequency from 3950-4100 MHz, depending on the workload. Users might scoff at the +400-550 MHz lift, but bearing in mind this is across all of the 64 cores simultaneously, and the cooling is built such that this frequency is sustained for renders or simulations that might take days. Further details of frequency and power later in the review.

While having the overclocked CPU is great, the Magnetar X64T system we were delivered also had memory, graphics, and storage.

Armari Magnetar X64T as shipped
(X64T-RD1600G3-FWL)
Processor AMD Ryzen Threadripper 3990X
Overclocked to ~4.0 GHz All-Core Turbo
Cooling Custom Armari FWLv2 Liquid Cooling Loop
Custom CPU+VRM Monoblock
420x45mm EK Coolsense Radiator
3 x EK-Vardar 140ER EVO 140mm fans
High Performance Pump
Clear Coolant, Designed for 3yr operation
Graphics PNY NVIDIA Quadro RTX 6000 24 GB
Motherboard ASRock TRX40 Taichi
Memory 256 GB of DDR4-3200
Power Supply 1600W 80PLUS Gold 93%, rated to 50ºC
0% fan under 40% load
9x PCIe connections
Storage ASRock Hyper Quad M.2 PCIe 4.0 x16 add-in card
1 x Corsair MP600 PCIe 4.0 x4 1 TB Boot Drive
2 x Corsair MP600 PCIe 4.0 x4 1TB Striped Array
Networking Realtek RTL8125 2.5 GbE (motherboard)
Intel I211-AT 1 GbE (motherboard)
Intel AX201 Wi-Fi 6 module (motherboard)
Audio Onboard Realtek ALC1220 + ALC4050H
Fans 3 x EK 140mm for radiator
2 x Noctua 140mm for internal airflow
1 x SanAce 80mm low noise for DRAM
Price as Built £10790 + tax
(~$14200 + tax)
UK Warranty 1 Year RTB
3 Year Parts+Labor
One service/coolant replacement, inc collection/pickup
Loaner systems available if bigger issues occur

The system as shipped came with an PNY NVIDIA RTX6000 graphics card, which is essentially an RTX 2080 Ti on steroids with 24 GiB of GDDR6, and the system can be configured with two of them. As Threadripper is not an ECC-qualified platform, the X64T comes with the peak configuration supported, 256 GB, but with custom SPD profiles to run up to DDR4-3600. Unfortunately due to how quickly this system was rebuilt for this review, the system I was sent was using DDR4-3200 at CL20, as some of the original memory was accidentally splashed with coolant, and Armari wanted to ensure I wouldn’t have any issues with the system.

Storage comes in two forms, both of which are PCIe 4.0. As shipped, we were specified with a boot drive to the tune of a Corsair MP600 1 TB PCIe 4.0 x4 drive. Another two of these drives were provided inside an ASRock Hyper M.2 PCIe 4.0 card, plugged into one of the PCIe 4.0 slots. Armari says that as newer and bigger PCIe 4.0 drives come to market beyond the Phison E16 solutions, this should expand to higher capacity drives or faster drives as required.

The power supply is a fully custom 1600W 80PLUS Gold unit, rated to run at 50 ºC with 93% efficiency. It has a custom fan profile directly from the OEM, and is set to only stir up the fans if the power required goes above 40% (640 W). The fully modular PSU has nine 8-pin connections and five 6-pin connections, providing 14 total, should any customer want to go above and beyond. The PSU on its own has a 10-year warranty.

The motherboard has a 2.5 GbE wired network port and a 1 GbE wired network port, and Armari does offer a 10G upgrade (space permitting based on the PCIe slots). Wi-Fi 6 support comes as standard, as does the ALC1220 audio configuration.

The chassis is the last custom part to discuss, with the system featuring the Magnetar naming on the front with the Armari logo. The chassis is big, but quite standard for a high-end workstation platform: 53cm x 22cm x 57cm (20.9-in x 8.7-in x 22.4-in), with a typical single GPU weight of 18 kg (39.7 lbs).

The chassis comes with handles on top that fold away, making the system easy to move around as required. I love these.

Inside there is lots of ‘space’ for directed airflow. The pump and reservoir is found in the bottom of the case, underneath the standard drive-bays, while the 3x140mm double thick radiator is at the top built into the side of the chassis. This is a special hinged mount, which makes the side panel easy to remove and the cooling apparatus easy to inspect.

There is a PCIe retention bracket for any add-in card installed, and in the base of the chassis is the power supply, hidden away. The insides weren’t necessarily built to look aesthetically pleasing, however the system as provided by Armari has a nice clean look.

Due to a couple of issues with arranging this system for review, I was told that normally Armari adds in some custom sealant to help with the liquid cooling loop, however as it requires 24 hours to set, they weren’t able to in this instance. The liquid cooling loop is pre-tested for every system they build at over 1 bar of pressure, along with full stability testing and thermal testing before shipping. For any reason if a system needs to be returned for warranty, Armari can supply a loaner system if required. As mentioned above, the standard warranty includes one full service and inspection, and the coolant can be replaced in order to give the customer another 3 years of ‘hassle free’ operation.

Today AMD and Armari are announcing that the new Magnetar X64T has set a new world record in the SPECworkstation 3 benchmark. The system that achieved this test is, by and large, the system I am testing today (it was stripped down and rebuilt with an updated water block). For the customers that Armari typically services this one of the primary benchmarks they care about, and so getting a new world record for a commercially available system should put Armari’s offerings high on their list.

Our testing, as shown over the next few pages, is similarly impressive. We already saw that the Threadripper 3990X with no overclock was a formidable beast in almost every one of our rendering and compute workloads. The only real comparison point we have to compare against is our W-3175X workstation that was provided when we reviewed that system.

The Magnetar X64T-RD1600G3 FWL (the full name) system in our testing is ~£10790 ($14200) excluding tax . This includes a Windows 10 Professional 64-bit license, and Armari’s 3 year premium workstation warranty, with 1-year on site and 2/3rd year parts and labor, along with a loaner system for the duration of any repairs.

Read over the next few pages for our testing on Performance and Power.

The best place to start for performance is to confirm that this system does get the best SPECworkstation 3 score ever. For users who have never heard of SPECworkstation, it comes from the same people that have the SPEC benchmark that we often use on new processors. The workstation element comes in because this set of benchmarks are designed to test a number of common workstation workloads, such as 3D rendering and animation, molecular modeling and dynamics, medical, oil and gas, construction and architecture, financial services, general operations, and GPU compute. This benchmark combine 30 workloads and ~140 tests into a single package, and results are given as a multiple of a performance compared to a ‘reference’ machine using an Intel Quad-core Skylake processor running a W3100 AMD GPU. This means that this quad-core Intel system gets a value of ‘1’.

SPECworkstation 3 Test Systems
AnandTech CPU GPU DRAM SSD Price
Fujistu Celsius R970 2 x Xeon 8276 RTX 8000 DDR4-2933 PCIe 3.0 $30000+
Armari Magnetar X64T TR3 3990X RTX 6000 DDR4-3200 PCIe 4.0 ~$14200
TR3 3990X ‘Stock’ TR3 3990X 2080 super DDR4-3200 SATA
W-3175X ‘Stock’ Xeon W-3175X 2080 Ti DDR4-2933 SATA

The current system at the top of the official SPECworkstation 3 standings is a Fujitsu Celsius R970 workstation (D3488-A2). This is the system that Armari has beaten with the X64T. The Fujitsu uses two Intel Xeon Platinum 8276 processors (28-core each, total 56-corepaired with an NVIDIA Quadro RTX 8000 and 384 GB of DDR4-2933. This system, going on list prices for just these components, already comes to $24538. Add in the rest, and some overhead, and this is easily $30000+. By comparison, Armari’s Magnetar X64T workstation is only ~$14200.

The results are as follows. Here we are comparing the Fujitsu official results to Armari’s official results. We also have included our results with the same system (technically classified as ‘estimated results’ because these haven’t been formally submitted to the results database), and a W-3175X system with an RTX 2080 Ti and PCIe 3.0 SSD.

SPECworkstation 3 Results
AnandTech Fujitsu
Celsius
R970*
Armari
Magnetar
X64T*
Our
X64T
Run
3990X
stock
3175X
2080 Ti
Media and Entertainment 4.72 7.04 6.84 4.79  
Product Development 6.07 10.85 9.95 3.51  
Life Sciences 5.89 8.24 8.11  
Financial Services 8.78 10.55 10.45 9.15  
Energy 5.44 9.09 8.73 4.20  
General Operations 2.27 2.53 2.45 1.55  
GPU Compute 5.40 5.75 5.70 4.63  
 
Geomean 5.17 7.06 6.84 4.08 **

*As submitted to SPEC
**awaiting results

Within each of these segments, 7-20 sub-tests are performed covering CPU, GPU, and Storage workloads. Our results were a little lower than Armari’s, however that can be down to tuning, ambient temperatures, and repeated runs. Our run was within 3%.

Overall, the Magnetar X64T results beat the old Fujitsu results by 37%:

  • CPU: Armari wins by +46%
  • GPU: Armari wins by +12%
  • Storage: Armari wins by +58%

Now, users might wonder how the Armari wins in the GPU tests, given that it has an RTX 6000 compared to the RTX 8000 in the Fujitsu. This is namely down to processor performance – the Fujitsu system processors have a base frequency of 2200 MHz, compared to the Magnetar X64T which can run all processors at 3925 MHz. Even if the Fujitsu was using the CPU in single core mode, and hitting its max turbo of 4000 MHz, the Armari would be using the better IPC of the Zen 2 core against Intel’s Skylake core.

Now each of the above tests are combined scores from sub-tests.

The Intel-based Fujitsu system does have some specific wins in individual tests, such as Maya Storage (+15%), NAMD Storage (+12%) and 7-zip CPU (+75%), however these mostly apply due to the increased memory capacity of the Intel machine.

The AMD-based Armari system has 40 other wins, including Blender CPU (+62%), handbrake CPU (+86%), CFD CPU (+108%), NAMD CPU (+164%), Seismic Data Processing (+230%), LAAMPS storage (+88%), and Creo GPU (+55%).

Full data for the Armari and the Fujitsu systems can be found at these links:

On the AnandTech benchmark side of the equation, if you read the recent article on our #CPUOverload project, we detail the 150 or so tests in our new testing suite that we aim to perform on as many CPUs as possible. These tests are designed for a wide range of systems, from highly responsive systems for user access, low powered devices, gaming machines, workstations, and the enterprise market, with a variation to cover a wide range of markets. All of our results will be published in our benchmark database, Bench, and the key ones that form the focus of the Magnetar are compared on this page.

The Magnetar X64T is a workstation through and through, with a focus on rendering, simulation, and hard core math. We’ve currently tested our new benchmark suite on around 20 processors, and out of these we have the following comparison points:

AnandTech Test Systems
AnandTech Cores DRAM
Armari Magnetar X64T 64C / 128T 4 Ch DDR4-3200 C20
TR3 3990X 64C / 128T 4 Ch DDR4-3200 C15
Xeon W-3175X 28C / 56T 6 Ch DDR4-2666 C20
Xeon Gold 6258R 28C / 56T 6 Ch DDR-2933 C21

The key win here for the Magnetar X64T is going to be multithreaded performance, where it hits 3.9-4.1 GHz all-core sustained depending on the test.

The key things to note here are between the Magnetar X64T and our stock 3990X. This system typically ships with DDR4-3866 C18, however due to a last minute system rebuild before the system was shipped, a coolant accident meant that for stability, the memory was replaced. As a result, we’re going to see some circumstances where the faster memory of the stock 3990X will in out: in our peak bandwidth test, the X64T scored 81 GB/s, and our 3990X scored 85 GB/s. The four-channel Threadripper also has a bandwidth deficit to the six-channel Xeons, which is noticeable in a couple of tests. However, the tests where the Magnetar wins, it’s usually by a lot, as shown in the previous page.

Out of these CPUs, nothing else we’ve tested since our new benchmark suite started comes close. I think the key product we’re missing here is a 64-core EPYC or Threadripper Pro, which we’re hoping to receive soon.

Here are our rendering benchmark results.

Blender 2.83 LTS: Link

One of the popular tools for rendering is Blender, with it being a public open source project that anyone in the animation industry can get involved in. This extends to conferences, use in films and VR, with a dedicated Blender Institute, and everything you might expect from a professional software package (except perhaps a professional grade support package). With it being open-source, studios can customize it in as many ways as they need to get the results they require. It ends up being a big optimization target for both Intel and AMD in this regard.

For benchmarking purposes, Blender offers a benchmark suite of tests: six tests varying in complexity and difficulty for any system of CPUs and GPUs to render up to several hours compute time, even on GPUs commonly associated with rendering tools. Unfortunately what was pushed to the community wasn’t friendly for automation purposes, with there being no command line, no way to isolate one of the tests, and no way to get the data out in a sufficient manner.

To that end, we fell back to one rendering a frame from a detailed project. Most reviews, as we have done in the past, focus on one of the classic Blender renders, known as BMW_27. It can take anywhere from a few minutes to almost an hour on a regular system. However now that Blender has moved onto a Long Term Support model (LTS) with the latest 2.83 release, we decided to go for something different.

We use this scene, called PartyTug at 6AM by Ian Hubert, which is the official image of Blender 2.83. It is 44.3 MB in size, and uses some of the more modern compute properties of Blender. As it is more complex than the BMW scene, but uses different aspects of the compute model, time to process is roughly similar to before. We loop the scene for 10 minutes, taking the average time of the completions taken. Blender offers a command-line tool for batch commands, and we redirect the output into a text file.

(4-1) Blender 2.83 Custom Render Test

Over the standard Threadripper system, the X64T is around 32% faster in our Blender scene.

Corona 1.3: Link

Corona is billed as a popular high-performance photorealistic rendering engine for 3ds Max, with development for Cinema 4D support as well. In order to promote the software, the developers produced a downloadable benchmark on the 1.3 version of the software, with a ray-traced scene involving a military vehicle and a lot of foliage. The software does multiple passes, calculating the scene, geometry, preconditioning and rendering, with performance measured in the time to finish the benchmark (the official metric used on their website) or in rays per second (the metric we use to offer a more linear scale).

The standard benchmark provided by Corona is interface driven: the scene is calculated and displayed in front of the user, with the ability to upload the result to their online database. We got in contact with the developers, who provided us with a non-interface version that allowed for command-line entry and retrieval of the results very easily.  We loop around the benchmark five times, waiting 60 seconds between each, and taking an overall average. The time to run this benchmark can be around 10 minutes on a Core i9, up to over an hour on a quad-core 2014 AMD processor or dual-core Pentium.

(4-2) Corona 1.3 Benchmark

Corona typically scales very well with core count and frequency, and here the X64T has a 28% lead over a stock 3990X.

V-Ray: Link

We have a couple of renderers and ray tracers in our suite already, however V-Ray’s benchmark came through for a requested benchmark enough for us to roll it into our suite. Built by ChaosGroup, V-Ray is a 3D rendering package compatible with a number of popular commercial imaging applications, such as 3ds Max, Maya, Undreal, Cinema 4D, and Blender.

We run the standard standalone benchmark application, but in an automated fashion to pull out the result in the form of kilosamples/second. We run the test six times and take an average of the valid results.

(4-5) V-Ray Renderer

Similarly, the X64T has a 30% performance gain.

Cinebench R20: Link

Another common stable of a benchmark suite is Cinebench. Based on Cinema4D, Cinebench is a purpose built benchmark machine that renders a scene with both single and multi-threaded options. The scene is identical in both cases. The R20 version means that it targets Cinema 4D R20, a slightly older version of the software which is currently on version R21. Cinebench R20 was launched given that the R15 version had been out a long time, and despite the difference between the benchmark and the latest version of the software on which it is based, Cinebench results are often quoted a lot in marketing materials.

Results for Cinebench R20 are not comparable to R15 or older, because both the scene being used is different, but also the updates in the code bath. The results are output as a score from the software, which is directly proportional to the time taken. Using the benchmark flags for single CPU and multi-CPU workloads, we run the software from the command line which opens the test, runs it, and dumps the result into the console which is redirected to a text file. The test is repeated for 10 minutes for both ST and MT, and then the runs averaged.

(4-6b) CineBench R20 Multi-Thread

Cinebench go brrrr. I will never get tired of a quick R20 run like this, at around 15 seconds for the X64T. Performance is +35% over the stock 3990X.

Beyond rendering workloads, we also have a number of key math-heavy specific workloads that systems like the Magnetar X64T were designed for.

AES Encoding

Algorithms using AES coding have spread far and wide as a ubiquitous tool for encryption. Again, this is another CPU limited test, and modern CPUs have special AES pathways to accelerate their performance. We often see scaling in both frequency and cores with this benchmark. We use the latest version of TrueCrypt and run its benchmark mode over 1GB of in-DRAM data. Results shown are the GB/s average of encryption and decryption.

(5-3) AES Encoding

Our test here has a limit of only using 64 threads, which the X64T takes to full effect applying maximum frequency across all the cores.

Agisoft Photoscan 1.3.3: link

Photoscan stays in our benchmark suite from the previous benchmark scripts, but is updated to the 1.3.3 Pro version. As this benchmark has evolved, features such as Speed Shift or XFR on the latest processors come into play as it has many segments in a variable threaded workload.

The concept of Photoscan is about translating many 2D images into a 3D model – so the more detailed the images, and the more you have, the better the final 3D model in both spatial accuracy and texturing accuracy. The algorithm has four stages, with some parts of the stages being single-threaded and others multi-threaded, along with some cache/memory dependency in there as well. For some of the more variable threaded workload, features such as Speed Shift and XFR will be able to take advantage of CPU stalls or downtime, giving sizeable speedups on newer microarchitectures.

For the update to version 1.3.3, the Agisoft software now supports command line operation. Agisoft provided us with a set of new images for this version of the test, and a python script to run it. We’ve modified the script slightly by changing some quality settings for the sake of the benchmark suite length, as well as adjusting how the final timing data is recorded. The python script dumps the results file in the format of our choosing. For our test we obtain the time for each stage of the benchmark, as well as the overall time.

(1-1) Agisoft Photoscan 1.3, Complex Test

Because Photoscan is a more varied workload, the gains from something like this are more niche.

3D Particle Movement v2.1: Non-AVX and AVX2/AVX512

This is the latest version of the benchmark designed to simulate semi-optimized scientific algorithms taken directly from my doctorate thesis. This involves randomly moving particles in a 3D space using a set of algorithms that define random movement. Version 2.1 improves over 2.0 by passing the main particle structs by reference rather than by value, and decreasing the amount of double->float->double recasts the compiler was adding in.

The initial version of v2.1 is a custom C++ binary of my own code, flags are in place to allow for multiple loops of the code with a custom benchmark length. By default this version runs six times and outputs the average score to the console, which we capture with a redirection operator that writes to file.


An example run on an i7-6950X

For v2.1, we also have a fully optimized AVX2/AVX512 version, which uses intrinsics to get the best performance out of the software. This was done by a former Intel AVX-512 engineer who now works elsewhere. According to Jim Keller, there are only a couple dozen or so people who understand how to extract the best performance out of a CPU, and this guy is one of them. To keep things honest, AMD also has a copy of the code, but has not proposed any changes.

The 3DPM test is set to output millions of movements per second, rather than time to complete a fixed number of movements. This way the data represented becomes a linear when performance scales and easier to read as a result.

(2-2) 3D Particle Movement v2.1 (Peak AVX)

Because the Intel processors have AVX-512, they win here. This is one of the fundamental differences that might put people in the direction of an Intel system, should their code be AVX-512 accelerated. For AVX-2 code paths, the Magnetar gets another 36% lead over the stock processor.

NAMD 2.13 (ApoA1): Molecular Dynamics

One of the popular science fields is modelling the dynamics of proteins. By looking at how the energy of active sites within a large protein structure over time, scientists behind the research can calculate required activation energies for potential interactions. This becomes very important in drug discovery. Molecular dynamics also plays a large role in protein folding, and in understanding what happens when proteins misfold, and what can be done to prevent it. Two of the most popular molecular dynamics packages in use today are NAMD and GROMACS.

NAMD, or Nanoscale Molecular Dynamics, has already been used in extensive Coronavirus research on the Frontier supercomputer. Typical simulations using the package are measured in how many nanoseconds per day can be calculated with the given hardware, and the ApoA1 protein (92,224 atoms) has been the standard model for molecular dynamics simulation.

Luckily the compute can home in on a typical ‘nanoseconds-per-day’ rate after only 60 seconds of simulation, however we stretch that out to 10 minutes to take a more sustained value, as by that time most turbo limits should be surpassed. The simulation itself works with 2 femtosecond timesteps.

(2-5) NAMD ApoA1 Simulation

NAMD requires a lot more core-to-core communication as well as memory access, and so we reach an asymptotic limit in our test here.

DigiCortex v1.35: link

DigiCortex is a pet project for the visualization of neuron and synapse activity in the brain. The software comes with a variety of benchmark modes, and we take the small benchmark which runs a 32k neuron/1.8B synapse simulation, similar to a small slug.

The results on the output are given as a fraction of whether the system can simulate in real-time, so anything above a value of one is suitable for real-time work. The benchmark offers a ‘no firing synapse’ mode, which in essence detects DRAM and bus speed, however we take the firing mode which adds CPU work with every firing.

I reached out to the author of the software, who has added in several features to make the software conducive to benchmarking. The software comes with a series of batch files for testing, and we run the ‘small 64-bit nogui’ version with a modified command line to allow for ‘benchmark warmup’ and then perform the actual testing.

The software originally shipped with a benchmark that recorded the first few cycles and output a result. So while fast multi-threaded processors this made the benchmark last less than a few seconds, slow dual-core processors could be running for almost an hour. There is also the issue of DigiCortex starting with a base neuron/synapse map in ‘off mode’, giving a high result in the first few cycles as none of the nodes are currently active. We found that the performance settles down into a steady state after a while (when the model is actively in use), so we asked the author to allow for a ‘warm-up’ phase and for the benchmark to be the average over a second sample time.

For our test, we give the benchmark 20000 cycles to warm up and then take the data over the next 10000 cycles seconds for the test – on a modern processor this takes 30 seconds and 150 seconds respectively. This is then repeated a minimum of 10 times, with the first three results rejected.

We also have an additional flag on the software to make the benchmark exit when complete (which is not default behavior). The final results are output into a predefined file, which can be parsed for the result. The number of interest for us is the ability to simulate this system in real-time, and results are given as a factor of this: hardware that can simulate double real-time is given the value of 2.0, for example.

The final result is a table that looks like this:

(3-1) DigiCortex 1.35 (32k Neuron, 1.8B Synapse)

Digicortex is another memory access asymptotic benchmark, however with more focus on the memory and interconnect latency as well. The differences between the X64T we tested and the stock 3990X likely come down to the memory configuration. 

SPEC2017rate

For a final benchmark, I want to turn to SPEC. Due to the nature of running SPEC2017 rate with all 128 threads, the run-time for this benchmark is over 16 hours, and so we’ve had to prioritize other testing to speed up the review process. We only did a single cycle of SPEC2017rate128, however we did score the following:

  • Average SPEC2017int rate 128 (estimated): 254.8
  • Average SPEC2017fp rate 128 (estimated): 234.0

The full sub-test results are in Bench. We normally showcase the results as a geomean as well, to which the Magnetar X64T scores 164.1, which compared to Intel’s 28 core, which scores 111.1.

Whenever we’ve tested big processors in the past, especially those designed for the super high power consumption tasks, they’ve all come with pre-prepared test systems. For the 28-core Intel Xeon W-3175X, rated at 255W, Intel sent a whole system with a 500 W Asetek liquid cooler as well as a second liquid cooler in case we were overclocking the system. When I tested the Core i9-9990XE, a 14-core 5 GHz ‘auction-only’ processor, the system integrator shipped a full 1U server with a custom liquid cooler to deal with the 400W+ thermals.

As with the Armari Magnetar X64T, the custom liquid cooling set up is an integral part of the system offering in order to achieve the high overclocked frequencies that the company promises.

Armari calls the solution its FWLv2, ‘designed to support a fully unlocked Threadripper 3990X at maximum PBO with all 64 cores sustained up to 4.1 GHz’. The solution consists of a custom monoblock created in partnership with EKWB designed to fit both the CPU and the VRM on the ASRock motherboard specifically. There is also additional convective heatsinks for the VRM to aid in additional cooling as the system also offers airflow through the chassis. This cooling loop hooks up to an EKWB Coolstream 420x45mm triple radiator, three EK-Vardar 140ER EVO fans, and a high-performance cooling pump with a custom performance profile. Armari claims a 3x better flow-rate than the best all-in-one liquid cooling solution on the market, with 200% better cooling performance and a lower noise profile (at a given power) due to the chassis design.

The system also includes two internal 140mm 1500 RPM Noctua fans, for additional airflow over the add-in components, and an angled 80mm low-noise SanAce fan mounted specifically for the memory and the VRM area of the motherboard.

As mentioned on the first page, the 420x45mm radiator is mounted on a swing arm inside the custom chassis. This makes it very easy to open the side of the case and perform maintenance. The chassis is a mix of aluminum inside and a steel frame, and weighs 18 kg / 39.7 lbs, but has handles on the top that hide inside the case, making it very easy to move but also look flush with the design. To be honest, this is a very nice chassis – it’s big, but given what it has to cool and the workstation element of it all, it is more than suitable. Externally, there are no RGB LEDs – a simple light on the top for the power/reset buttons, and blue accents at the front.

As you can probably see inside, there’s no aesthetic to pander to, especially when these systems are only meant be opened for maintenance. The standard Armari 3-year warranty for the UK (1year RTB, 2/3rd year parts + labor) includes a free full-system checkup and coolant replacement during that period.

With all that said, an overclocked 3990X is a bit of a beast, both in power consumption and cooling requirements. Armari told us going into this review that we’ll likely see a series of different power consumptions based on the workload, especially when it comes to sustained codes.

Our normal look into power consumption is typically with our y-cruncher test, which deals solely in integers:

For this test, the system was CPU over 400 W from start to finish, and the peak power was 505 W. CPU temperature averaged at 70 ºC and peaked at 82 ºC, with the CPU frequency average at 4002 MHz.

For a less dense workload that involves a mixture of math, we turn to our Agisoft test. This involves converting 2D images to 3D models, and involves four algorithmic stages – some fully multi-threaded, and others that are more serially coded.

The bulk of the test was done at around 270 W, with a single peak at 375 W. CPU temperatures never broke 50ºC.

Where we saw the real power was in our 3DPMavx floating point math test. This uses AVX2 like y-cruncher, but in a much denser calculation.

So this test runs a loop for 10 seconds, then idles for 10 seconds, hence the up and down part. There are six different loops, so each one is having a different effect on the power based on the instruction density. The test then repeats – I’ve cut the graph at 300 seconds to get a clear view

The peak power is 640 W – there is no rest when the workload is this heavy, and the CPU very quickly idles down to 70 W where possible. The peak temperature with a workload this heavy, even in small 10 second bursts, was 89ºC. Depending on the exact nature of the instructions, we saw sustained all-core frequencies at 3925 MHz to 4025 MHz.

As another angle to this story, I ran a script to plot the power consumption during the AIDA64 integer stress test, and cycled through 0 threads loaded up to 256 threads loaded, with two minutes load followed by two minutes idle.

The power consumption peaks at 450 W, as with the previous integer test, and we can see a slow steady rise from 106 W with one thread loaded up to 64 threads loaded. In this instance, AIDA64 is clever enough to split threads onto separate cores. In a normal 3990X scenario, we’re going to be seeing about 3.2 W per core at full load – in this instance, we’re approaching 6 W per core. For floating point math, where we see those 640 W peaks, it’s closer to 9 W per core. Bearing in mind that some of the consumer Ryzen Zen 2 processors are similarly running 9-12 W per core at full load, this is a bit wild.

Now, mentioning something like 640 W running at 10 seconds already got the CPU to 89ºC, the next question is what happens to the system when that load is sustained. Depending on the use case, some software might focus on INT workloads while others prefer FP. I attached a wall power meter to the system and fired up an 8K Blender render, and left the system on for over 10 minutes.

As you can see from the video, the software starts at around 4050 MHz, and slowly decreases over time to keep everything in check to about 3850 MHz. During this time, the system averaged about 900 W at the wall, with ~935 W peak. In this time, the temperature didn’t go above 92ºC, and with an audio meter at a distance of 1 ft, I measured 45-49 dB (compared to an idle of 36 dB).

To pile even more on, I turned to software that could load up the overclocked processor as well as the Quadro RTX 6000 in the system. TheaRender is our benchmark of choice here – it’s a physically based global illumination render that can find how many samples per pixel it can calculate in any given time. It works on CPU and GPU simultaneously.

This pushed the system hard. The benchmark can take up to 20 minutes or more, and the wall meter peaked at 1167 W for the full system. This was the one and only time I heard the cooling fans kick into high gear, at 52-55 dB. Thermals on the CPU were measured at 96 ºC, which seems to be a pseudo-ceiling. Even with that said, the processor was still running at 3850 MHz all-core.

I was running the system in a 100 sq ft room, and so pushing 1000 W for a long time without adequate air flow will warm up the room. I left my benching scripts on overnight, especially the per-thread power loading, to notice the next morning the room was warm. For any users planning to run sustained workloads in a small office, take note. Even though the system is much quieter than other workstation-class systems I’ve tested, should there be an option to place the system in an air-conditioned environment external to the work desk, this might be preferable.

Two of the most significant growth parts of my life have come around extracting as much performance out of a piece of hardware as possible. When I sat doing my PhD, in the lab, pumping out CUDA code to run through simulations in minutes instead of months, the onus was on speed – the more you could simulate in a day, the more insights you could get. As an extreme overclocker, it was pushing the silicon to its absolute limit, even if only for a few minutes, that yielded success and tasted triumph.

Today, as an editor, a technology analyst, and a journalist, the core of ‘getting work done’ is longer contingent on the highest performance computing – it’s about how carefully I can test, how I can manage relationships with vendors and experts, and then what content I can create (at least in a written sense here at AnandTech). The only ‘performance’ aspect to my work is how many systems I can test in a given time, and that is usually more limited by space, hardware, or other projects needing attention. Despite this, that desire for fast computing has never gone away. No matter if I’m dealing with laptop responsiveness, or distributing files over the network, having access to performance makes things easier (or at least if they’re wrong, I can identify a mistake faster!).

For a number of commercial verticals that demand high performance, the nature of that performance can directly affect throughput. Whether it’s something like rapid prototyping, or 3D/visual effects, or animation rendering, or medical imaging and processing, or scientific simulations, it’s all a question of throughput and data. This is the market Armari is targeting with the Magnetar X64T.

In our testing, much with the regular non-overclocked Threadripper 3990X, what the Magnetar X64T does well on it does *really* well. The system has been calibrated to handle integer and floating point workloads around that 4.0 GHz all-core frequency, and our thermal/audio analysis shows it to be easily more than suitable for the workstation market it is going into. The cherry on the top is in getting that SPECworkstation 3 world record, beating the high profile OEMs with a nicely built system.

Not only that, but the price is really impressive. Our system came with a Quadro RTX 6000, 256 GB of DDR4, 3TB of PCIe 4.0 storage, a custom 1600W 80PLUS Gold power supply, a custom chassis, and a three year warranty: for $14000 (pre-tax). Just the Threadripper 3990X and the Quadro RTX 6000 together are a base $8000. Add in the other hardware, the custom liquid cooling setup with a custom block and the TRX40 motherboard and 256 GB of high speed memory, with a 3 year warranty and a free checkup/coolant refill, and I suspect the big OEMs will be hard pressed to match the price. Not only that, the equivalent Intel system, using dual 28-core parts, starts easily costing $20k+ before even looking at memory or graphics.

There are some negative things to highlight, however. For a system that encourages the CPU to draw around twice the power (or more), performance gains for our tests are more in the 30-35% range. The side-effect of overclocking a CPU is that the power efficiency is lower as the processor moves more out of its ideal efficiency range. However one might argue that to match the performance with other hardware requires multiple systems, which has more power draw. Another element will be that this system is limited to 256 GB of non-ECC memory; this is an AMD limitation rather than an Armari limitation, but some of Armari’s customers will no-doubt want similar performance but more memory, and probably ECC memory. And to that end, we also get to a potential performance bottleneck – having 64 cores and 128 threads working at this high speed needs a lot of memory bandwidth. Threadripper can only support 4 channels, and at DDR4-3200 that equates to ~100 GB/s (80-85 GB/s real world), leaving less than 2 GB/s per core. In a number of our tests, we saw this to be a limiting factor.

Something like AMD’s Threadripper Pro solves most of these – more memory support, ECC support, eight memory channels. However the overclocking ability would be lost, which for a system like this where the OC performance is what makes it special, removing it would be the equivalent of ripping out its soul. Ideally AMD would need a product that pairs the 8-channel + ECC support with a processor overclock.

All that being said, Armari believes it has built something that its typical customer base will love. It’s a custom super high-performing workstation with a substantial world-record that you can buy, and for the visual effects studios in London that need the horsepower, AMD and Armari has it on tap.

As a small aside, I wondered how well the X64T would do in the ‘extreme’ overclocking leaderboards, where hell fears the liquid nitrogen. The best score I obtained for Cinebench R20 was 31006, which would put it 16th on the all-time leaderboard across all R20 submissions ever – the only way to get a higher score with air or liquid cooling would be to use a dual EPYC server. For Cinebench R15, a score of 12406 gives position #12 in the all-time list. This is somewhat insane for a system someone can just buy.

Here’s a couple our Cinebench R20 runs, in under 15 seconds apiece.

The final question is how to get one (if you were interested). The Armari Magnetar X64T-RD1600G3/FWL is already available for UK and the EU. Armari is in discussions with resellers/distributors in the US, however the warranty arrangement is slightly different. Alongside the X64T, Armari is preparing a rack-mounted 2U version of the X64T with an IPMI-enabled motherboard to come out later in Q4 – something the larger VFX houses have requested en masse. This is set for global certification, and is pending a North American distributor.