One of the most visited parts of the AnandTech website, aside from the reviews, is our benchmark database Bench. Over the last decade we’ve placed in there as much benchmark data as we can for every sample we can get our hands on: CPU, GPU, SSD, Laptop and Smartphone being our key categories. As the Senior CPU editor here at AnandTech, one of my duties is to maintain the CPU part of Bench, making sure the benchmarks are relevant and the newest components are tested with benchmark data up to date as much as possible. Today we are announcing the start of a major Bench project with our new Benchmark suite, and some very lofty goals.
What is Bench?
A number of our regular readers will know Bench. We placed a link to easily access it at the top of the page, although given the depth of content it holds, is an understated part of AnandTech. Bench is the centralized database where we place all of the benchmark data we gather for processors, graphics, storage, tablets, laptops and smartphones. Internally Bench has many uses, particularly when collating review data to generate our review graphs, rather than manually redrawing full data sets for each review or keeping datasets offline.
But the biggest benefit with Bench is to either compare many products in one benchmark, or compare two products across all our benchmark tests. For example, here are the first few results of our POV-Ray test.
At its heart, Bench is a comparison tool, with the ability to square two products off side by side can be vital when choosing which one to invest in. Rather than just comparing specifications, Bench provides real world data, offering 3rd party independent verification of data points. In contrast to the benchmarks other companies that are invested in selling you the product might provide, we try and create benchmarks that actually mean something, rather than just list the synthetics.
The goal of Bench has always been a regressive comparison, comparing what the user has to what the user might be looking at purchasing. As a result of a decade of data, that 3-5 year generational gap of benchmark information can become vital to actually quantifying how much of an upgrade a user might receive on the CPU alone. It all depends on what products already have benchmark data in the database, and if the benchmarks are relevant to the workflow (web, office, rendering, gaming, workstation tests, and so on).
Bench: The Beginning
Bench originally started over a decade ago by the founder of AnandTech, Anand. On the CPU side of the database, he worked with both AMD and Intel to obtain a reasonable number of the latest CPUs of the day, and then spent a good summer testing them all. This happened back when Core 2 and Athlons were running the market, with a number of interesting comparisons. The beauty of the Bench database is that all the data from the 30 or so processors Anand tested way back then still exists today, with the core benchmarks of interest to the industry and readership at the time.
With AMD and Intel providing the processors they did, testing every processor became a focal point for the data: it allowed users to search for their exact CPU, compare it to other models in the same family that differ on price, or compare the one they already have to a more modern component they were thinking of buying.
As the years have progressed, Bench has been updated with all the review samples we could obtain and have time to put through the benchmarks. When a new product family is launched however, we rarely get to test them all – unfortunately official sampling rarely goes beyond one or two of the high end products, or if we were lucky, maybe a few more. While we’ve never been able to test full processor stacks from top to bottom, we have typically been able to cover the highlights of a product range, and it has still allowed users to perform general comparisons using the data and for users looking to upgrade their three year old components.
Two main factors have always inhibited the expansion of Bench.
Bench Problem #1: Actually Getting The Hardware
First, the act of sourcing the components can be a barrier to obtaining benchmark data. If we do not have the product, we cannot run the benchmarks! Intel and AMD (and VIA, back in the day) have had different structures for sampling their products, depending on how much they want to say, the release time frame, and the state of the market. Other factors can include the importance of certain processors to the financials of a company, or level of the relationship between us and the manufacturers. Intel and AMD will only work with review websites at any depth if the analysis is fair, and our readers (that’s you) would only read the data if the analysis was unbiased as well.
When it comes down to the base media sampling strategies, companies can typically take two routes. The nature of the technology industry is down to Press Relations (PR), and most companies will have both internal PR departments and also outsource local PR to companies that specialize in that region. Depending on the product, sampling can occur either direct from the manufacturer or via the local PR team, and the sampling strategy will be pre-determined at a much higher level: how many media websites are to be sampled, how many samples will be distributed to each region etc. For example, if a product is going to be sampled via local PR only, there might only be 3-5 units for 15+ technology media outlets, requiring that the samples be moved around when they have been tested. Some big launches, or depending on the relationship between the media outlet with the manufacturer, will be managed from the company internal global PR team, where samples are provided in perpetuity: essentially on long-term loans (which could be recalled).
For the x86 processor manufacturers, Intel and AMD are the players we work with. Of late, Intel’s official media sampling policy provides the main high-end processor in advance of the processor release, such as the i7-4770K, or the i7-6700K. On rare occasions, one of the lower down parts down the stack are provided at the same time, or made available for sampling after the launch date. For example, with the latest Comet Lake, we were sampled both the i9-10900K and the i5-10600K, however these are both high-impact overclockable CPUs. This typically means that if there’s an interesting processor down the stack, such as an i3-K or a low cost Pentium, then we have to work with other partners to get a sample (such as motherboard manufacturers, system integrators, or OEMs), or outright purchase it internally.
For AMD’s processors, as demonstrated over the last 4-5 years, the company does not often release a full family stack of CPUs at one time. Instead, processors are launched in batches, with AMD choosing to do two or three every few months. For example, AMD initially launched Ryzen with the three Ryzen 7 processors, followed by four Ryzen 5 processors a few weeks later and finally two Ryzen 3 parts. With the past few generations from AMD, depending on how many processors are in the final stack of CPUs, AnandTech is usually sampled most of them, such as with 1st Gen Ryzen where we were sampled all of them. Previously with the Richland and Trinity processors, only around half the stack were initially offered for review, and less chance of being sampled for the lower value parts, or some parts were offered through local PR teams a couple of months after launch. AMD still today launches OEM parts for specific regions – it tends not to sample those to press either, especially if the press are not in the region for that product.
With certain processors, they target certain media organizations that prioritize different elements of testing, which lends to an imbalance of which media get which CPUs. Most manufacturers will rate the media outlets they work with into tiers, with the top tier ones getting earlier sampling or more access to the components. The reason for this is that if a company sampled everyone everything every time, suddenly 5000 media outlets (and anyone who wants to start a component testing blog) would end up with 10-25 products on their doorstep every year and it would be a mammoth task to organize (for little gain from the outlets with fewer readers).
The concept of tiering is not new – it depends on the media outlets readership reach, the demographic, and the ability to understand the nuance of what is in their hands. AMD and Intel can’t sample everyone everything, and sometimes they have specific markets to target, which will also shift focus on who will get what samples. A website focused on fanless HTPCs for example would not be a preferred sampling vector for workstation class processors. At AnandTech, we cover a broad range of topics, have educated readers, and have been working with Intel and AMD for twenty years. On the whole, we generally do well when it comes to processor sampling, although there are still limits – going out and asking for a stack of next generation Xeon Gold CPUs is unlikely to be as simple as overnight shipping.
Bench Problem #2: The March of Time
The second problem with the benchmark database is timing and benchmarks. This comes down to manpower – how many people are running the benchmarks, and the timeframes for which the benchmarks we do test remain relevant for the segments of our readers that are interested in the hardware.
Take graphics card testing, for example: GPU drivers change monthly, and games are updated every few months (and the games people are playing also change). To keep a healthy set of benchmark data, it requires retesting 5 graphics cards per GPU vendor generation, 4-5 generations of GPU launches, from 3-4 different board partners, on 6-10 games every month at three different resolutions/settings per game (and testing each combination enough to be statistically accurate). That takes time, significant effort, and manpower, and I’m amazed Ryan has been able to do so much in the little time he has being the Editor-in-Chief. Picking the highest numbers out of those ranges gives us 5 (GPUs) x 2 (vendors) x 5 (generations) x 4 (board partners) x 10 (games) x 3 (resolutions) x 4 (statistically significant) results, which comes to 24000 benchmark runs, out of the door each month, in an ideal scenario. You could be halfway through and someone issues a driver update, making the rest of the data for naught. It’s not happening overnight, and arguably that could be work for at least one full time employee if not two.
On the CPU side of the equation, the march of time is a little slower. While the number of CPUs to test can be higher (100+ consumer parts in the last few generations), the number of degrees of freedom is smaller, and the rate of our CPU benchmark refresh cycles can be longer. These parameters depend on OS updates and drivers like the GPU testing, but it means that some benchmarks can still be relevant several years later with the same operating system base. 30 year old legacy Fortran code still in use is likely going to stay 30 year old legacy Fortran code in the near future. Or even benchmarks like CineBench R15 are still quoted today, despite the Cinema4D software on which it is based is several generations newer. The CPU testing ends up ultimately limited by the gaming tests, and depends on which modern GPUs are used, what games are being tested, what resolutions are relevant, or when new benchmarks enter the fray.
When Ryan retests a GPU, he has a fixed OS, system ready to go, updates the drivers, and puts the GPU back into the slot. Preparing a new CPU platform for new benchmarks means rebuilding the full system, reinstalling the OS, reinstalling the benchmark suite, and then testing it. However, with the right combination of hardware and tests, a good set of data can last 18 months or so without significant updates. The danger is that whenever there is a full benchmark refresh, which especially revolves around updates to newer operating systems. Due to how OS updates and scheduling with the software stack effects the new operating system, all the old data cannot be compared and the full set of hardware has to be retested on the new OS with an updated benchmark suite.
With our new CPU Overload project (stylized as #CPUOverload in our article titles, because social media is cool?), the aim is to get around both of these major drawbacks.
What is #CPUOverload?
The seeds of this project were initially sown several years ago in 2016. Despite having added our benchmark data to Bench for several years, I had kind of known our Benchmark database was a popular tool, but I didn’t really realize how much it was used, or more precisely, under optimized, until recently when I was given access to be able to dig around in our back-end data.
Everyone shopping for a processor wants to know how good the one they’re interested in is, and how much of a jump in performance they’ll get from their old part. Reading reviews is all well and good, but due to style and applicability, only a few processors are directly compared in a review to a different part specifically, otherwise the review could be a hundred pages long. There have been many times where Ryan has asked me to scale back from 30000 data points in a review!
Also it’s worth noting that reviews are often not updated with newer processor data, as there would be a factual disconnect with the textual analysis underneath.
This is why Bench exists. We often link in each review to Bench and request users go there to compare other processors, or for legacy benchmarks / benchmark breakdowns that are not in the main review.
But for #CPUOverload, with the ongoing march of Windows 10, and the special features therein (such as enabling Speed Shift on Intel processors, new scheduler updates for ACPI 6.2, and having the driver model to support DX12), it has been getting time for us to update our CPU test suite. Our recent reviews were mostly being criticized for still using older hardware, namely the GTX 1080s that I was able to procure, along with having some tests that didn’t always scale with the CPU. (It is worth noting that alongside sourcing CPUs for testing, sourcing GPUs is somewhat harder – asking a vendor or the GPU manufacturer for two or three or more of the same GPU without a direct review is a tough ask.) The other angle is that in any given month, I will get additional requests to benchmark specific CPU tests – users today would prefer seeing their workload in action for comparison, rather than general synthetics, for obvious reasons.
There is also a personal question of user experience on Bench, which has not aged well since our last website layout update in 2013.
In all, the aims of CPU Overload are:
- Source all CPUs. Focus on ones that people actually use
- Retest CPUs on Windows 10 with new CPU tests
- Retest CPUs on Windows 10 with new Gaming tests
- Update the Bench interface
For the #CPUOverload project, we are testing under Windows 10, with a variety of new tests, including AI and SPEC, with new gaming tests on the latest GPUs, and more relevant real world benchmarks. But the heart of CPU Overload is this:
We want to have every desktop CPU since 2010 tested on our new benchmarks. By my count, there are over 900.
We want to have every desktop CPU since 2010 tested on our new benchmarks.
Updating our testing suite is all well and good, but in order for users to find the data relevant, it has to span as many processors as possible. Using tools such as Intel’s ARK, Wikipedia, CPU-World and others, I have compiled a list of over 800+ x86 processors (actually 900+ when this article goes live) which qualify. At the highest level, I am splitting these into four categories:
- Intel Consumer (Core i-series, HEDT)
- Intel Enterprise (Xeon, Xeon-W, 1P, 2P)
- AMD Consumer (Ryzen, FX, A-Series)
- AMD Enterprise (EPYC, Opteron)
Within both AMD and Intel, the consumer and enterprise arms of each company are discretely different business units, with product teams, and rarely is there any cross-over. The separation of the departments is easy to follow, in that ‘Consumer’ basically stands for mainstream processors that are aimed at machines a user or an OEM could build with off-the-shelf parts, and typically do not support ECC memory. ‘Enterprise’ is going to refer to processors that might end up in workstations, servers or data centers, that have professional grade features, and most of these parts do support ECC memory.
The next level of separation for the processors, for our purposes, is going to be under the heading ‘family’. Family is a term that typically groups the processors by the microarchitecture, but could also have separation based on socket or features. For CPU Overload, choosing one high-level category breaks down like this:
- AMD Consumer (360+ processors), inc Pro
- Ryzen 3000 (Threadripper, Ryzen 9/7/5/3; 7nm Zen2)
- Ryzen 2000 (Threadripper, Ryzen 7/5/3; 12+ Zen+)
- Ryzen 1000 (Threadripper, Ryzen 7/5/3; 14nm Zen)
- Bristol Ridge (A12 to Athlon X4, 28nm AM4-based Excavator v2 APUs)
- Carrizo (Athlon X4, 28nm FM2+ Excavator)
- Kaveri Refresh (A10 to Athlon X2, FM2+)
- Kaveri (A10 to Sempron X2, FM2+)
- Richland (A10 to Sempron X2, FM2+ Piledriver)
- Trinity (A10 to Sempron X2, FM2/2+ Steamroller)
- Llano (A8 to Sempron X2, 32nm FM1 K10)
- Kabini (FM1, 28nm Jaguar)
- Vishera (FX-9590 to FX-4300, 32nm AM3 Piledriver)
- Zambezi (FX-8100 to FX-4100, 32nm AM3 Bulldozer)
- AM3 Phenom II X6 to X4 (K10, Thuban/Zosma/Deneb)
- (Optional) Other AM3 (K10, Zosma, Deneb, Propus, Heka, etc)
- (Optional) Other AM2 (Agena, Toliman, Kuma)
Neither AMD nor Intel provides complete lists of the processors it launched within a certain family. Intel does it best, under its ark.intel.com (known as ARK) platform, however sometimes there are obscure CPUs that do not make the official list due to partnerships or being geo-specific. The only way we end up knowing these obscure CPUs exist is because someone has ended up with the processor in their system and run diagnostic tests. Intel calls these processors ‘off-roadmap’ CPUs, and while it only provides information on them through ARK if you know the exact processor number already. Scouring the various resources available online to draw up that picture proved one thing: no list is complete. I doubt the one I have is complete either.
For example, most users believe that the last AMD FX processor that was made by AMD was the massive 220W FX-9590 for the AM3 platform. This is not the case.
AMD released two FX CPUs on FM2+, and these were only sold in HP pre-built systems. The two CPUs are the FX-770K and the FX-670K. Typically FX processors are known for being only on the AM3+ platform on the desktop, however AMD and HP struck a deal to give the premium FX name to these other CPUs and they were never launched at retail – in order to get these we found a seller on eBay that had pulled them out of old systems.
In some lists we found online, it was very easy to get mixed up because some companies have not kept their naming consistent. Take the strange case of the Athlon X4 750. Despite the name not having a suffix, is classified as a newer family to the Athlon X4 750K. The X4 750 being Piledriver based and the X4 750K which is Steamroller based.
Then there are region specific CPUs, like the FX-8330, which was only released in China.
This is the current standing of the ‘Intel Consumer’ Core i7 and Core i5 processors up to Coffee Lake. Ones marked in yellow are ones that I have immediately available, ready for testing, and ones marked in red are still to-be-obtained. The common thread is that Intel has supplied all the Core i7-K processors for most generations, and the i5-K processors for some of them. The big blocks of yellow for Kaby Lake and Skylake were also sourced from Intel. The other singular dotted processors are either ones that I have purchased personally from my own stash, or ones that we have obtained via partners, such as the S/T processors when we covered some low-power hardware a few years ago. Needless to say, there are plenty of gaps, especially on the latest (and unannounced processors), but also going further back, before I was in charge of the CPU reviews.
Some of the feedback we have had with this project is that the database technically does not need every CPU that ever existed to be relevant, and even then, for some CPUs if we reduce the frequency multiplier, it will perform the same as a processor we do not have. While for some CPUs that is true, it has to be as long as the testing does not fall foul of the power states, the turbo states, the points at which turbo frequencies are enabled and the appropriate frequency/voltage curve binning (and if the cache sizes line up). While this can be the case, it is often on a case-by-case basis. However for the scope of this project, and for this project to be authentic in the data it is going to give, one of the rules I am imposing is that the data has to come from testing with the CPU on hand – no synthetic numbers from ‘simulated’ processors.
Rule: No Synthetic Numbers from Simulated Processors
I mentioned that sourcing is one of the most difficult parts of this project, and the obvious answer to get hardware is to go direct to the manufacturer and request it: both manufacturers end up being big parts of this project regardless of their active participation, but the best scenario is that they should be the hardware source.
As the four areas above (AMD/Intel, Consumer/Enterprise) are for lack of a better description different companies, the press contact points for the consumer and enterprise sides of each company are different. As a result, we have different relationships with each of the four, and one of the interesting barriers to sampling is rebuilding relations when long-term contacts leave. Sometimes this happens for the better (sampling improves), or for the worse (a severe reluctance to offer anything).
Unfortunately sometimes there a wall – business unit policies for sampling often mean only one CPU here or there can be offered due to what’s available for media distribution, or if the company, the press contact, or the product manager does not see any value to the business in sampling a component (such as an Intel Pentium or an AMD A9), then it is unlikely we would get that sample from them. Part of that relationship with these companies is demonstrating the value of this data.
Another aspect is not actually having any samples – these are PR teams, not infinitely packed stock rooms. So if the team we are in contact with does not have access to certain parts that we request, such as consumer-grade parts that were built specifically for certain OEMs that are not under the ‘consumer’ PR team’s remit, or even some of the low priority parts in a stack, they can’t loan them to us. It sounds somewhat odd that a big company like Intel or AMD wouldn’t have access to a part that I’m looking for, but take those HP-only FX CPUs I mentioned earlier – despite it being a consumer grade CPU, because it ended up being a B2B transaction to supply these parts, it would have never passed the hands of the PR team, and any deal with the OEM may have put reviews of the hardware solely at the discretion of the OEM. Or the region-only parts, then only the PR team in that region will have access to them. (I eventually picked up those parts on my own dime on eBay, but this isn’t always possible.)
Nonetheless, we have approached as many people internally at both companies, as well as some OEMs and resellers, with our CPU Overload project idea. Both of the consumer arms of Intel and AMD have already provided a good first-round bounty of the latest hardware, and in most cases complete stacks of the newest generations. The enterprise hardware is a little tricky to get hold of. But many thanks to our Intel and AMD contacts that are already on-board with CPU Overload, as we try to work closer with the other units.
One thing to mention is that the newer the processor, the easier it is to get direct from the manufacturer. Typically these parts are already within their sampling quotas. However, if I go and ask for a Sandy Bridge Core i3-2125 from 2011, a sample to share is unlikely to be at their fingertips – there might be one in the drawer in a lab somewhere, but that is never a guarantee. This is where the project will have to look to private sales, forums, and online auction sites to play a role as we move further into the past. Depending on how the project goes, we may reach out to our readers (either in a project update, or on my twitter @IanCutress) for certain parts to complete the stacks. This has already worked for at least three hard-to-find CPUs, such as the HP FX CPUs (the FX-770K and FX-670K), and the Athlon X4 750 (not 750K), which we picked up from eBay and China respectively.
For the initial few months of the project, we have around 200 CPUs to begin. This breaks down into the following:
|CPU Overload Project Status|
|CPUs on Hand||Key Notes|
|Intel Consumer||138 / 406||29 / 46 : HEDT
72 / 241 : Core
37 / 119 : Pentium/Celeron
|AMD Consumer||137 / 366
|42 / 105 : AM4 and TR
43 / 108 : FM2+/FM1/AM1
52 / 153 : AM3/3+
Xeon E and
|27 / 155||75% of E3-1200 v5
75% of E3-1200 v4
36% of E3-1200 v3
|AMD EPYC||11 / 40||Known Socketed EPYC
Lots of unknown in Cloud
|Others||Opteron 6000: 2 / 121
|Total (Phase 1)||313 / 967|
For the first phase, we are almost at a good level, having 33.7% of the processors needed. However, the models we do have a fairly localized in the Skylake/Kaby Lake-S sets, Intel’s HEDT range, and some of AMD’s stack. There is still a good number of interesting segments missing in what we have to hand.
The K10, LGA1366 Xeons and the older Opterons are part of the secondary scope of this project. Some of them are easy to obtain with bottomless pockets and a trip to eBay, and others require more research. There is a potential ‘Phase 1.5’, if we were to go after all of the Xeon E5-1600 and E5-2600 processors as well. Then it becomes an issue of tackling single socket vs dual socket systems, as well as suitable NUMA software.
So out of the two main issues with a project of this size, sourcing and benchmark longevity, we’re trying to tackle both with one go – retest everything with a new benchmark suite on with the latest stable OS. The only factor left is time – retesting all these CPUs doesn’t happen overnight. The key numbers to begin with testing however will be the headline Intel processors back to Sandy Bridge, and the AMD parts back to FX.
For anyone that has ever had to do boring, repetitive tasks, there is always the wish that it could be done without any interaction at all. For a number of professional applications, automation can be a primary requirement – the ability to press a button and let something go, with consistency every time, removes headaches and can lead to scaling out the process.
When it comes to benchmarking, having an automated test suite enables several benefits. Tests can have consistent delays between each test to provide the same environment for temperature and turbo ramps, it should arrange the cache and standardize cache defragmentation, and it lends itself to repeated consistent results. Bonus points are then awarded if the testing can then be scaled out to multiple systems at once. Sitting at a system with irregular jumps in testing can add in more degrees of freedom on things that might not be consistent and effect the results. Plus it becomes incredibly dull, incredibly fast. I mean OEM product manufacturing line dull. To all my fellow reviewers out there, I know the pain when you have several hundred hours of gameplay on something like Far Cry 5, but it’s all just benchmarking.
This is where I point to the well-known graph about automation (original source unknown):
For small tasks or projects, sometimes manually doing the work is quicker. If it takes 5 minutes to do the task manually, but then 8 hours to write the script which saves 5 seconds, the script has to be run 5760 times for the payoff. If the script is run 50 times a day, then the payoff will be in 115 days. This ignores scale out, if the script allows multiple systems to run concurrently, but for a lot of tasks can make it a no brainer to put the effort in. Otherwise, 3 years later, it becomes ultimately depressing when running CineBench for the 80000th time. (Insert stories from TheDailyWTF about how in a company a boss does not want automation because it might kill their job). Insert obligatory XKCD.
When I first started at AnandTech, testing motherboards, I did not run anything automated. Going through a basic motherboard testing suite manually took three days, because when testing you have to be alert and present every time a test finished to run the next one (and if the mind wanders, that 2-minute test becomes 15 minutes until you realize it’s done). For our 2015 CPU Benchmark suite, a basic script that was written performed about 20 tests and lasted around 4 hours. It looked like spaghetti code, and very quickly became annoying to manage and update, especially when a benchmark decided it wasn’t going to work/needed to be bypassed – there was no easy way to add benchmarks either. On top of this, benchmark installing was manual. Insert more XKCD. Thank you XKCD.
The new scripts for our Windows 10 testing are larger, modular, and more involved. The goal was essentially to automate everything down to what was feasibly possible, within my knowledge (or didn’t require much learning), and required no user interaction. Over the course of two months, while testing which benchmarks were usable and applicable, two major scripts were written: CPU Tests and CPU Gaming Tests.
How to Automate: Batch Files, Powershell, and AHK
There are many ways to automate in a system. Ganesh, for example, uses PowerShell almost exclusively to call benchmarks from the command line. To say that PowerShell is a glorified command prompt doesn’t do it justice, but Ganesh ensures that his workloads for mini-PC testing can only ever run from the command line, and the results can be parsed therein.
I’m not as au fait with PowerShell (if I had time for a crash course, it’d be on my to-do list), so I use a combination of batch files and a tool called AutoHotKey (AHK for short). AHK is a simple enough scripting language which can run programs, call command line functions, call PowerShell scripts, emulate mouse movements, emulate clicks and keyboard presses, and perform internal math, with subroutine support. It is like a poor man’s C++, with an alarming number of foibles, such as poor type definition and zero type checking, but it can work if you treat it right.
For each benchmark I tested for suitability, either a fixed benchmark like Cinebench or a custom workload such as WinRAR or Blender, I tried to get the test to run from a simple batch file command line and manipulate the output. For Cinebench 15, the output is part of the stderr, and for Photoscan it outputs a results file due to the python script it requires that Agisoft provided (and I’ve edited). For WinRAR it is a timing function wrapper around a command line call pointing at the workload, and for Civilization 6 it’s a simple flag after adjusting the settings file. For benchmarks like Gears Tactics, or Cinebench R10, there is no command line option and we have to turn to AHK to simulate keyboard presses.
So with each benchmark profiled, the individual tests are written as separate functions in AHK with three stages: preparation/installation, execution, and result parsing.
Preparation involves ensuring that the benchmark can be run in its current state, installing it if it isn’t, and deleting any previous temporary results file (if present) to ensure the directory structure is valid where needed. With the right preparation, running each test in the same manner makes the result as consistent as possible. Parsing the output into something suitable usually means flicking through an output file and doing the appropriate regular expression functions to pull out the required value. Some tests automatically allow for repeated results (Corona or 3DPMv2), whereas others need multiple runs specified (WinRAR) and those results can be put into an array and averaged or geomeaned using AHK. A final function is written to take the results and ply them into a custom results directory.
Outside of the testing functions is a general preparation element to the script. For our testing we have four main modes: the full list of tests, a short list of tests (determined in the script), running a single test, and an option to continue from a certain point of a full test run (in case one benchmark needed attention and errored out the process, such as a web benchmark when the server host fails). The initialization of one of our scripts asks which benchmarks suite is required, and detects the CPU/GPU present in the system, before offering a default location to save the results based on the CPU/GPU combo. By having the results location determined when the script is started, we can move results to the directory as each test finishes, and the results are parsed into an easy to read format for a mental check before they go into the database. For ease of use, I have a results location on a NAS, and so as the script uploads benchmark results to it, I can start looking at the results uploaded to the NAS as the other benchmarks are running. Useful when running to a deadline! We also do additional checks on the state of Spectre and Meltdown fixes in the OS, to ensure consistency.
Sanity Checks of Results and Running Order
Mental checks of results become important – being able to spot an outlier, or identifying when a result seems abnormal. For example, through the initial testing, I noticed that one of the results in one of our web tests (scoring ~100ms) was staying in the clipboard for the next web test (scoring 700ms). This gave a much lower average for the second test – and this only happened on fast CPUs. Similarly with game tests, over the benchmark being repeated multiple times, sometimes a result (for whatever reason) might be 10% down on all the others. So either automatic detection of outliers needs to be in place (doesn’t work if two results out of four repeats are bad), or a manual mental check needs to take place. There are a few things that automation can’t replace easily, such as experience. This is where for some tests an average might be representative, or a median might be more appropriate.
Also useful to note is determining the benchmark running order. Experience with our previous automation has shown that the shortest tests should run first, in order to populate our results directory on the NAS quicker, and the longer tests should be near the end but not right at the end. The tests that more frequently cause unpredictable errors (e.g. DLL support on a new platform causing a system to hang, or a benchmark that is reliant on online license servers which could be down for maintenance) are put in last, so an overnight run will go through as many tests as possible first before tackling potential breaks in the testing.
GPU Tests and Steam
The methods listed above work for our CPU and CPU Gaming tests. The CPU Gaming tests have an additional element, given that we are using games from Steam, and we are using only one log-in account for multiple systems under test at once. For the most part, if the game title likes to run nice offline, the test can be run offline. Unfortunately there are some games where the benchmark script will run 95% smoother (GTA, RDR2) when the user is logged in, due to online DRM checks.
For this, the script I’ve written runs a test and lock mechanism when trying to log in to Steam, and only tries to run the online tests if the account is not already signed in elsewhere. If the account is already signed in on a different system, the first system will instead automatically run one of the offline tests and come back after one test to see of online is available. If not, it will run another of the other offline tests, check again and so on, until there are no more offline tests to run, where it will sit and wait and probe every 120 seconds for access to Steam. For the machine that is online, it will run both sets of the online tests back-to-back, and then go back offline to run the rest of the offline tests, freeing the lock for any other machine that needs it. Some of this uses Steam’s APIs, probing how Steam’s login mechanism works, and undocumented features.
Windows 10 Pro
As we started to use Windows 10 Pro in our last update, there’s a large opportunity for something to come in and disrupt our testing. Windows 10 is known to kick-in and check for updates at any hour of the day (and we’re testing 24hr), so anything that can interrupt or take CPU time away from benchmarking is a bit of a hassle. There’s also the added element of Windows silently adjusting the update schedule and moving places in the registry without warning.
During building this latest suite, Microsoft launched Windows 10 version 2004. There is always a question as to what we should do in this regard – move to the absolute latest, or take a step back to something more stable and fewer bugs but it might not be as relevant. In order to not create any level of programming debt, by which lots of work is needed to fix the smallest issues that might arise, we often choose the latter. In this regard, we are using Windows 10 version 1909 (18363.900). It has since transpired, from talking to peers, that 2004 has a number of issues that would affect benchmarking consistency, which validates our concerns.
Naturally, the first thing an OS wants to do when it starts up is connect to the internet and update. We install the OS without the internet connected, and our install image automatically sets the update period to the maximum period possible. The scripts we run are continuously updated to ensure that when the benchmark starts, the ‘don’t restart’ period for the OS is resynchronized to the latest possible time. There’s nothing worse than a restart in the middle of a scripted run to wake up in the morning to find that the system rebooted at 1am.
The OS is installed manually with most of the default settings, and disabling all the extra monitoring features offered on install. On entering the OS, our default strategy is multiple: disable the ability to update as much as possible in the registry, disable Windows Defender, uninstall OneDrive, disable Cortana as much as possible, implement the high performance mode in the power options and disable the platform from turning off the display. We also pull the latest version of CPU-Z from network storage, in case we are testing a very new system. Another script is in place to run when the OS loads, to check the CPU and GPU is what we expect, as well as the GPU drivers that we needed are in place, as Windows has a habit of updating those without saying anything. Windows Defender is also disabled, as it (personally) has historically seems to eat CPU time if the network changes for no reason, even when the system is in use.
Some of these strategies are designed to be redundant. The goal here is to attack the option needed in as many different ways as possible. There’s nothing lost by being thorough at this point and hammering the point home. This means executing registry files that adjust settings, executing batch files which do the same while installing files, and reiterating these commands before every benchmark run in order to be crystal clear. Simply put, do not implicitly trust Windows to leave the settings alone. Something always invariably changes (or moves somewhere else) if it is not monitored. Some of these commands that are in place are also old/legacy, but are kept as they don’t otherwise adjust the system (and can take effect if options that are continually moved around suddenly move back).
It is worth noting that some of the options, when run through a batch file, require the file to be run as Administrator. Windows 10 makes a frustrating task to do so manually recently without implementing user access elevation. The best way to ensure that the batch file always runs in admin mode seems to be to create a shortcut to the batch file, and adjusting the properties of the shortcut to always enable the ‘run as admin’ mode. It is an interesting kludge for that to work, and it is frustrating I cannot just adjust the batch file properties directly to run as admin every time.
When choosing a benchmark, it often falls under two headers – standalone, such that it can be run as is, or ones that need installation. With installation, these are subdivided further into those with silent installers, and those who have to have the installation done manually.
Installing benchmarks can either be done before running the main script, or be integrated directly into the main testing script. As time has progressed, we have moved from the former to the latter, so we can wrap uninstall commands into the script if we only get limited access to a system. For the manually installed benchmarks this isn’t possible, and technically calling an install/uninstall from the script does make total testing time longer, but it also reduces requirements for SSD capacity by not having everything installed at once. Experience of doing this scripting over the past few years, and making the benchmark scripts as portable as possible, have pointed to making the install/uninstall part of the benchmark run.
Benchmarks that could be run without installing, known as ‘standalone’ benchmarks, are the holt grail. Cinebench and others are great for this. But for the others, these are probed for silent install methods. Certain benchmarks in the past, such as PCMark8, also have additional features to enable online registration to enable DRM through the command line. Other installers, such as .msi files, seem to be unable to be installed if they are not in the directory from which the batch file was called without the right commands. When scripting successive installs, it becomes important to check the previous one has finished before another one starts, otherwise the script might jump straight to the next installer before the previous ones were finished, making it tricky as well.
For msi files, our install code relies heavily on the following command to ensure that installs are finished before tackling the next one:
cmd /c start /wait msiexec /qb /i <file>
Most .msi files have the same flags for silent installs, however install executables can vary significantly and require probing the vendor documentation. For the most part, a ‘/S’ flag is the silent install flag, while others require /norestart to ensure the system doesn’t restart immediately, or /quiet, to get going in a silent fashion. Some installations use none of these and rely on their own definitions of what constitutes a silent install flag. I’m looking at you, Adobe. However ultimately, most software packages that can install silently, or require additional commands to enable licenses, and are ready to be called for their respective tests.
One benchmark is a special case: Chrome. Chrome has the amazing ability to update itself as soon as it is installed – even without opening it or when the system is booted. To stop this from happening is more than just a simple software adjustment, purely because Google no longer offers an option to delay updates. We initially found an undocumented way to stop it from updating, which requires the install script to gut some of the files after installing the software in order to stop this happening, however the quick update cycle of Chrome means that our v56 version from last year is now out of date. To get over this, we are using a standalone version of Chromium.
The final benchmark in our install is Steam, which is a fully manual only install. Valve has created Steam with a really odd interface interaction mechanism type, even for AHK scripting, which makes installing Steam a bit of a hassle. Valve does not offer a complete standalone installer here, so the base program opens after installation to download ~200MB of updates on a fresh system. We install the software over the Steam directory already present on the benchmark partition from a previous OS install, so the games do not need to be re-downloaded. (When an OS is installed, it’s installed on a specific OS partition, and all benchmarks are kept on a second partition).
One other point to be aware of is when software checks for updates. Loading AIDA, for example, means that it will probe online for the latest version and leave a hanging message box to be answered before a script can continue. There are often two ways to do this, and the best is if the program allows the user to set the ‘no updates’ automatically in the configuration files. The fall back tactic that works is to disable the internet connectivity (often by disabling all network adaptors through PowerShell) while the application is running.
Our new CPU tests go through a number of main areas. We cover Web tests using our un-updateable version of Chromium, opening tricky PDFs, emulation, brain simulation, AI, 2D image to 3D model conversion, rendering (ray tracing, modeling), encoding (compression, AES, video and HEVC), office based tests, and our legacy tests (throwbacks from another generation of code but interesting to compare). Over the next few pages we’ll go over the high level of each test.
However, as mentioned in passing on the previous page, we run a number of registry edit commands again to ensure that various system features are turned off and disabled at the start of the benchmark suite. This includes disabling Cortana, disabling the GameDVR functionality, disabling Windows Error Reporting, disabling Windows Defender as much as possible again, disabling updates, and re-implementing power options and removing OneDrive, in-case it sprouted wings again.
A number of these tests have been requested by our readers, and we’ve split our tests into a few more categories than normal as our readers have been requesting specific focal tests for their workloads. A recent run on a Core i5-10600K, just for the CPU tests alone, took around 20 hours to complete.
- Peak Power (y-Cruncher using latest AVX)
- Per-Core Loading Power using POV-Ray
- Agisoft Photoscan 1.3: 2D to 3D Conversion
- Application Loading Time: GIMP 2.10.18 from a fresh install
- Compile Testing (WIP)
- 3D Particle Movement v2.1 (Non-AVX + AVX2/AVX512)
- y-Cruncher 0.78.9506 (Optimized Binary Splitting Compute for mathematical constants)
- NAMD 2.13: Nanoscale Molecular Dynamics on ApoA1 protein
- AI Benchmark 0.1.2 using TensorFlow (unoptimized for Windows)
- Digicortex 1.35: Brain stimulation simulation
- Dwarf Fortress 0.44.12: Fantasy world creation and time passage
- Dolphin 5.0: Ray Tracing rendering test for Wii emulator
- Blender 2.83 LTS: Popular rendering program, using PartyTug frame render
- Corona 1.3: Ray Tracing Benchmark
- Crysis CPU-Only: Can it run Crysis? What, on just the CPU at 1080p? Sure
- POV-Ray 3.7.1: Another Ray Tracing Test
- V-Ray: Another popular renderer
- CineBench R20: Cinema4D Rendering engine
- Handbrake 1.32: Popular Transcoding tool
- 7-Zip: Open source compression software
- AES Encoding: Instruction accelerated encoding
- WinRAR 5.90: Popular compression tool
- CineBench R10
- CineBench R11.5
- CineBench R15
- 3DPM v1: Naïve version of 3DPM v2.1 with no acceleration
- X264 HD3.0: Vintage transcoding benchmark
- Kraken 1.1: Depreciated web test with no successor
- Octane 2.0: More comprehensive test (but also deprecated with no successor)
- Speedometer 2: List-based web-test with different frameworks
- Geekbench 4
- AIDA Memory Bandwidth
- Linux OpenSSL Speed (rsa2048 sign/verify, sha256, md5)
- LinX 0.9.5 LINPACK
- SPEC2006 rate-1T
- SPEC2017 rate-1T
- SPEC2017 rate-nT
It should be noted that due to the terms of the SPEC license, because our benchmark results are not vetted directly by the SPEC consortium, we have to label them as ‘estimated’. The benchmark is still run and we get results out, but those results have to have the ‘estimated’ label.
- A full x86 instruction throughput/latency analysis
- Core-to-Core Latency
- Cache-to-DRAM Latency
- Frequency Ramping
- A y-cruncher ‘sprint’ to see how 0.78.9506 scales will increasing digit compute
Some of these tests also have AIDA power wrappers around them in order to provide an insight in the way the power is reported through the test.
For our new set of CPU Gaming tests, we wanted to think big. There are a lot of users in the ecosystem that prioritize gaming above all else, especially when it comes to choosing the correct CPU. If there is a chance to save $50 and get a better graphics card for no loss in performance from the CPU, then this is the route that gamers would prefer to tread. The angle here though is tough – lots of games have different requirements and cause different stresses on a system, with various graphics cards having different reactions to the code flow of a game. Then users also have different resolutions and different perceptions of what feels ‘normal’. This all amounts to more degrees of freedom than we could hope to test in a lifetime, only for the data to become irrelevant in a few months when a new game or new GPU comes into the mix. Just for good measure, let us add in DirectX 12 titles that make it easier to use more CPU cores in a game to enhance fidelity.
When it comes down to gaming tests, some of the same rules apply to the CPU tests. If we can get standalone versions of tests, then perfect – even better if they will never update, because that gives us a consistent codebase to work with. However, given the nature of Steam or Origin or the EPIC Store, having a consistent code base is not always possible. So for our gaming tests, for those that we could find with offline DRM-free variants (such as those from GOG), we used those instead. Otherwise we rely on Steam for the most part, because it is the only store front that offers an external API to allow us to check if an account is online – and thus a single account to be used across multiple systems. When scaling out automation, it can be difficult when there are multiple accounts to deal with, so as we aim for fewer than 10 systems running simultaneously, one account is enough.
I could speak for a few days about the gripes of automating gaming benchmarks – the ones that do it well compared to the ones that have no consideration for the others that want to use an in-game benchmark repeatedly. There’s also the discussion for in-game benchmarks vs native benchmarks, which I’ve had many times with colleagues and peers, that I might go into depth sometime. But I have thrown benchmark titles out for the stupidest things – updates that cause *new* splash screens is why I’ve cut games like AoTS and Civ6 in the past. Or Ubisoft games that offer benchmark modes that do not output benchmark results files. Or those files that create HTML files that need to be pruned for the correct data, rather than a simple text file. Or shall we go into games that have their settings not as simple ini files, but are embedded in the registry !?! Total War gets thrown out for not allowing key presses in its menus, and then having cheat detection when you try to emulate mouse movements. I have, on multiple occasions, spent a day of work trying to code for a game that just doesn’t want to work – as a result, it gets thrown out of our benchmark suite.
In the past, we’ve tackled the GPU benchmark set in several different ways. We’ve had one GPU to multiple games at one resolution, or multiple GPUs take a few games at one resolution, then as the automation progressed into something better, multiple GPUs take a few games at several resolutions. However, based on feedback, having the best GPU we can get hold of over a dozen games at several resolutions seems to be the best bet.
Normally securing GPUs for this testing is difficult, as we need several identical models for concurrent testing, and very rarely is a GPU manufacturer, or one of its OEM partners, happy to hand me 3-4+ of the latest and greatest. In that aspect, over the years, I have to thank ECS for sending us four GTX 580s in 2012, MSI for sending us three GTX 770 Lightnings in 2014, Sapphire for sending us multiple RX 480s and R9 Fury X cards in 2016, and in our last test suite, MSI for sending us three GTX 1080 Gaming cards in 2018.
For our testing on the 2020 suite, we have secured three RTX 2080 Ti GPUs direct from NVIDIA. These GPUs have been optimized for with drivers and in gaming titles, and given how rare our updates are, we are thankful for getting the high-end hardware. (It’s worth noting we won’t be updating to whatever RTX 3080 variant is coming out at some point for a while yet.)
On the topic of resolutions, this is something that has been hit and miss for us in the past. Some users state that they want to see the lowest resolution and lowest fidelity options, because this puts the most strain on the CPU, such as a 480p Ultra Low setting. In the past we have found this unrealistic for all use cases, and even if it does give the best shot for a difference in results, the actual point where you come GPU limited might be at a higher resolution. In our last test suite, we went from the 720p Ultra Low up to 1080p Medium, 1440p High, and 4K Ultra settings. However, our most vocal readers hated it, because even by 1080p medium, we were GPU limited for the most part.
So to that end, the benchmarks this time round attempt to follow the basic patter where possible:
- Lowest Resolution with lowest scaling, Lowest Settings
- 2560×1440 with the lowest settings (1080p where not possible)
- 3840×2160 with the lowest settings
- 1920×1080 at the maximum settings
Point (1) should give the ultimate CPU limited scenario. We should see that lift as we move up through (2) 1440p and (3) 4K, with 4K low still being quite strenuous in some titles.
Point (4) is essentially our ‘real world’ test. The RTX 2080 Ti is overkill for 1080p Maximum, and we’ll see that most modern CPUs pull well over 60 FPS average in this scenario.
What will be interesting is that for some titles, 4K Low is more compute heavy than 1080p Maximum, and for other titles that relationship is reversed.
So we have the following benchmarks as part of our script, automated to the point of a one-button run and out pops the results approximately 10 hours later, per GPU. Also listed are the resolutions and settings used.
- Chernobylite, 360p Low, 1440p Low, 4K Low, 1080p Max
- Civilization 6, 480p Low, 1440p Low, 4K Low, 1080p Max
- Deus Ex: Mankind Divided, 600p Low, 1440p Low, 4K Low, 1080p Max
- Final Fantasy XIV: 768p Min, 1440p Min, 4K Min, 1080p Max
- Final Fantasy XV: 720p Standard, 1080p Standard, 4K Standard, 8K Standard
- World of Tanks enCore: 768p Min, 1080p Standard, 1080p Max, 4K Max
- Borderlands 3, 360p VLow, 1440p VLow, 4K VLow, 1080p Badass
- F1 2019, 768p ULow, 1440p ULow, 4K ULow, 1080p Ultra
- Far Cry 5, 720p Low, 1440p Low, 4K Low, 1080p Ultra*
- Gears Tactics, 720p Low, 4K Low, 8K Low 1080p Ultra
- Grand Theft Auto 5, 720p Low, 1440p Low, 4K Low, 1080p Max
- Red Dead Redemption 2, 384p Min, 1440p Min, 4K Min, 1080p Max
- Strange Brigade DX12, 720p Low, 1440p Low, 4K Low, 1080p Ultra
- Strange Brigade Vulkan, 720p Low, 1440p Low, 4K Low, 1080p Ultra
For each of the games in our testing, we take the frame times where we can (the two that we cannot are Chernobylite and FFXIV). For these games, at each resolution/setting combination, we run them for as many loops in a given time limit (often 10 minutes per resolution). Results are then taken as average frame rates and 95th percentiles.
Some of the games are ultimately still being evaluated for usefulness, and may eventually be dropped – Far Cry 5 has taken more time than I care to admit to get to work. Some of these titles require the exact CPU/GPU combination to be part of the settings files otherwise the settings file will be discarded, which gets ever increasingly frustrating.
*Update 7/20 : I recently found that Far Cry 5 has additional requirements regarding monitor resolution support. If the settings file requests a resolution that it can’t detect in the monitor on the test bed, then it defaults to 1080p. My test beds contain two brands of 4K monitor – Dell UP2415Qs and cheap 27-inch TN displays, in a 50:50 split. For whatever reason, FC5 doesn’t really like any resolution changes on the Dell monitors. I can adjust the resolution scale (0.5x-2.0x) for this game, and quality, but I only found this out on 7/20, which means we have to rerun chips for this data.
If there are any game developers out there involved with any of the benchmarks above, please get in touch at firstname.lastname@example.org. I have a list of requests to make benchmarking your title easier!
The other angle is DRM, and some titles have limits of 5 systems per day. This may limit our testing in some cases; in other cases it is solvable.
Our previous set of ‘office’ benchmarks have often been a mix of science and synthetics, so this time we wanted to keep our office section purely on real world performance.
Agisoft Photoscan 1.3.3: link
Photoscan stays in our benchmark suite from the previous benchmark scripts, but is updated to the 1.3.3 Pro version. As this benchmark has evolved, features such as Speed Shift or XFR on the latest processors come into play as it has many segments in a variable threaded workload.
The concept of Photoscan is about translating many 2D images into a 3D model – so the more detailed the images, and the more you have, the better the final 3D model in both spatial accuracy and texturing accuracy. The algorithm has four stages, with some parts of the stages being single-threaded and others multi-threaded, along with some cache/memory dependency in there as well. For some of the more variable threaded workload, features such as Speed Shift and XFR will be able to take advantage of CPU stalls or downtime, giving sizeable speedups on newer microarchitectures.
For the update to version 1.3.3, the Agisoft software now supports command line operation. Agisoft provided us with a set of new images for this version of the test, and a python script to run it. We’ve modified the script slightly by changing some quality settings for the sake of the benchmark suite length, as well as adjusting how the final timing data is recorded. The python script dumps the results file in the format of our choosing. For our test we obtain the time for each stage of the benchmark, as well as the overall time.
The final result is a table that looks like this:
The new v1.3.3 version of the software is faster than the v1.0.0 version we were previously using on the old set of benchmark images, however the newer set of benchmark images are more detailed (and a higher quantity), giving a longer benchmark overall. This is usually observed in the multi-threaded stages for the 3D mesh calculation.
Technically Agisoft has renamed Photoscan to MetaShape, and is currently on version 1.6.2. We reached out to Agisoft to get an updated script for the latest edition however I never heard back from our contacts. Because the scripting interface has changed, we’ve stuck with 1.3.3.
Application Opening: GIMP 2.10.18
First up is a test using a monstrous multi-layered xcf file we once received in advance of attending an event. While the file is only a single ‘image’, it has so many high-quality layers embedded it was taking north of 15 seconds to open and to gain control on the mid-range notebook I was using at the time.
For this test, we’ve upgraded from GIMP 2.10.4 to 2.10.18, but also changed the test a bit. Normally, on the first time a user loads the GIMP package from a fresh install, the system has to configure a few dozen files that remain optimized on subsequent opening. For our test we delete those configured optimized files in order to force a ‘fresh load’ each time the software in run.
We measure the time taken from calling the software to be opened, and until the software hands itself back over to the OS for user control. The test is repeated for a minimum of ten minutes or at least 15 loops, whichever comes first, with the first three results discarded.
The final result is a table that looks like this:
Because GIMP is optimizing files as it starts up, the amount of work required as we increase the core count increases dramatically.
Ultimately we chose GIMP because it takes a long time to load, is free, and actually fits very nicely with our testing system. There is software out there that can take longer to start up, however I found that most of it required licences, wouldn’t allow installation across multiple systems, or that most of the delay was contacting home servers. For this test GIMP is the ultimate portable solution (however if people have suggestions, I would like to hear them).
In this version of our test suite, all the science focused tests that aren’t ‘simulation’ work are now in our science section. This includes Brownian Motion, calculating digits of Pi, molecular dynamics, and for the first time, we’re trialing an artificial intelligence benchmark, both inference and training, that works under Windows using python and TensorFlow. Where possible these benchmarks have been optimized with the latest in vector instructions, except for the AI test – we were told that while it uses Intel’s Math Kernel Libraries, they’re optimized more for Linux than for Windows, and so it gives an interesting result when unoptimized software is used.
3D Particle Movement v2.1: Non-AVX and AVX2/AVX512
This is the latest version of the benchmark designed to simulate semi-optimized scientific algorithms taken directly from my doctorate thesis. This involves randomly moving particles in a 3D space using a set of algorithms that define random movement. Version 2.1 improves over 2.0 by passing the main particle structs by reference rather than by value, and decreasing the amount of double->float->double recasts the compiler was adding in.
The initial version of v2.1 is a custom C++ binary of my own code, flags are in place to allow for multiple loops of the code with a custom benchmark length. By default this version runs six times and outputs the average score to the console, which we capture with a redirection operator that writes to file.
For v2.1, we also have a fully optimized AVX2/AVX512 version, which uses intrinsics to get the best performance out of the software. This was done by a former Intel AVX-512 engineer who now works elsewhere. According to Jim Keller, there are only a couple dozen or so people who understand how to extract the best performance out of a CPU, and this guy is one of them. To keep things honest, AMD also has a copy of the code, but has not proposed any changes.
The final result is a table that looks like this:
The 3DPM test is set to output millions of movements per second, rather than time to complete a fixed number of movements. This way the data represented becomes a linear when performance scales and easier to read as a result.
y-Cruncher 0.78.9506: www.numberworld.org/y-cruncher
If you ask anyone what sort of computer holds the world record for calculating the most digits of pi, I can guarantee that a good portion of those answers might point to some colossus super computer built into a mountain by a super-villain. Fortunately nothing could be further from the truth – the computer with the record is a quad socket Ivy Bridge server with 300 TB of storage. The software that was run to get that was y-cruncher.
Built by Alex Yee over the last part of a decade and some more, y-Cruncher is the software of choice for calculating billions and trillions of digits of the most popular mathematical constants. The software has held the world record for Pi since August 2010, and has broken the record a total of 7 times since. It also holds records for e, the Golden Ratio, and others. According to Alex, the program runs around 500,000 lines of code, and he has multiple binaries each optimized for different families of processors, such as Zen, Ice Lake, Sky Lake, all the way back to Nehalem, using the latest SSE/AVX2/AVX512 instructions where they fit in, and then further optimized for how each core is built.
For our purposes, we’re calculating Pi, as it is more compute bound than memory bound. In single thread mode we calculate 250 million digits, while in multithreaded mode we go for 2.5 billion digits. That 2.5 billion digit value requires ~12 GB of DRAM, so for systems that do not have that much, we also have a separate table for slower CPUs and 250 million digits.
y-Cruncher is also affected by memory bandwidth, even in ST mode, which is why we’re seeing the Xeons score very highly despite the lower single thread frequency.
Personally I have held a few of the records that y-Cruncher keeps track of, and my latest attempt at a record was to compute 600 billion digits of the Euler-Mascheroni constant, using a Xeon 8280 and 768 GB of DRAM. It took over 100 days (!).
NAMD 2.13 (ApoA1): Molecular Dynamics
One of the popular science fields is modelling the dynamics of proteins. By looking at how the energy of active sites within a large protein structure over time, scientists behind the research can calculate required activation energies for potential interactions. This becomes very important in drug discovery. Molecular dynamics also plays a large role in protein folding, and in understanding what happens when proteins misfold, and what can be done to prevent it. Two of the most popular molecular dynamics packages in use today are NAMD and GROMACS.
NAMD, or Nanoscale Molecular Dynamics, has already been used in extensive Coronavirus research on the Frontier supercomputer. Typical simulations using the package are measured in how many nanoseconds per day can be calculated with the given hardware, and the ApoA1 protein (92,224 atoms) has been the standard model for molecular dynamics simulation.
Luckily the compute can home in on a typical ‘nanoseconds-per-day’ rate after only 60 seconds of simulation, however we stretch that out to 10 minutes to take a more sustained value, as by that time most turbo limits should be surpassed. The simulation itself works with 2 femtosecond timesteps.
How NAMD is going to scale in our testing is going to be interesting, as the software has been developed to go across large supercomputers while taking advantage of fast communications and MPI.
AI Benchmark 0.1.2 using TensorFlow: Link
Finding an appropriate artificial intelligence benchmark for Windows has been a holy grail of mine for quite a while. The problem is that AI is such a fast moving, fast paced word that whatever I compute this quarter will no longer be relevant in the next, and one of the key metrics in this benchmarking suite is being able to keep data over a long period of time. We’ve had AI benchmarks on smartphones for a while, given that smartphones are a better target for AI workloads, but it also makes some sense that everything on PC is geared towards Linux as well.
Thankfully however, the good folks over at ETH Zurich in Switzerland have converted their smartphone AI benchmark into something that’s useable in Windows. It uses TensorFlow, and for our benchmark purposes we’ve locked our testing down to TensorFlow 2.10, AI Benchmark 0.1.2, while using Python 3.7.6 – this was the only combination of versions we could get to work, because Python 3.8 has some quirks.
The benchmark runs through 19 different networks including MobileNet-V2, ResNet-V2, VGG-19 Super-Res, NVIDIA-SPADE, PSPNet, DeepLab, Pixel-RNN, and GNMT-Translation. All the tests probe both the inference and the training at various input sizes and batch sizes, except the translation that only does inference. It measures the time taken to do a given amount of work, and spits out a value at the end.
There is one big caveat for all of this, however. Speaking with the folks over at ETH, they use Intel’s Math Kernel Libraries (MKL) for Windows, and they’re seeing some incredible drawbacks. I was told that MKL for Windows doesn’t play well with multiple threads, and as a result any Windows results are going to perform a lot worse than Linux results. On top of that, after a given number of threads (~16), MKL kind of gives up and performance drops of quite substantially.
So why test it at all? Firstly, because we need an AI benchmark, and a bad one is still better than not having one at all. Secondly, if MKL on Windows is the problem, then by publicizing the test, it might just put a boot somewhere for MKL to get fixed. To that end, we’ll stay with the benchmark as long as it remains feasible.
As you can see, we’re already seeing it perform really badly with the big chips. Somewhere around the Ryzen 7 is probably where the peak is. Our Xeon chips didn’t really work at all.
Simulation and Science have a lot of overlap in the benchmarking world, however for this distinction we’re separating into two segments mostly based on the utility of the resulting data. The benchmarks that fall under Science have a distinct use for the data they output – in our Simulation section, these act more like synthetics but at some level are still trying to simulate a given environment.
DigiCortex v1.35: link
DigiCortex is a pet project for the visualization of neuron and synapse activity in the brain. The software comes with a variety of benchmark modes, and we take the small benchmark which runs a 32k neuron/1.8B synapse simulation, similar to a small slug.
The results on the output are given as a fraction of whether the system can simulate in real-time, so anything above a value of one is suitable for real-time work. The benchmark offers a ‘no firing synapse’ mode, which in essence detects DRAM and bus speed, however we take the firing mode which adds CPU work with every firing.
I reached out to the author of the software, who has added in several features to make the software conducive to benchmarking. The software comes with a series of batch files for testing, and we run the ‘small 64-bit nogui’ version with a modified command line to allow for ‘benchmark warmup’ and then perform the actual testing.
The software originally shipped with a benchmark that recorded the first few cycles and output a result. So while fast multi-threaded processors this made the benchmark last less than a few seconds, slow dual-core processors could be running for almost an hour. There is also the issue of DigiCortex starting with a base neuron/synapse map in ‘off mode’, giving a high result in the first few cycles as none of the nodes are currently active. We found that the performance settles down into a steady state after a while (when the model is actively in use), so we asked the author to allow for a ‘warm-up’ phase and for the benchmark to be the average over a second sample time.
For our test, we give the benchmark 20000 cycles to warm up and then take the data over the next 10000 cycles seconds for the test – on a modern processor this takes 30 seconds and 150 seconds respectively. This is then repeated a minimum of 10 times, with the first three results rejected.
We also have an additional flag on the software to make the benchmark exit when complete (which is not default behavior). The final results are output into a predefined file, which can be parsed for the result. The number of interest for us is the ability to simulate this system in real-time, and results are given as a factor of this: hardware that can simulate double real-time is given the value of 2.0, for example.
The final result is a table that looks like this:
The variety of results show that DigiCortex loves cache and single thread frequency, is not too fond of victim caches, but still likes threads and DRAM bandwidth.
Dwarf Fortress 0.44.12: Link
Another long standing request for our benchmark suite has been Dwarf Fortress, a popular management/roguelike indie video game, first launched in 2006 and still being regularly updated today, aiming for a Steam launch sometime in the future.
Emulating the ASCII interfaces of old, this title is a rather complex beast, which can generate environments subject to millennia of rule, famous faces, peasants, and key historical figures and events. The further you get into the game, depending on the size of the world, the slower it becomes as it has to simulate more famous people, more world events, and the natural way that humanoid creatures take over an environment. Like some kind of virus.
For our test we’re using DFMark. DFMark is a benchmark built by vorsgren on the Bay12Forums that gives two different modes built on DFHack: world generation and embark. These tests can be configured, but range anywhere from 3 minutes to several hours. After analyzing the test, we ended up going for three different world generation sizes:
- Small, a 65×65 world with 250 years, 10 civilizations and 4 megabeasts
- Medium, a 127×127 world with 550 years, 10 civilizations and 4 megabeasts
- Large, a 257×257 world with 550 years, 40 civilizations and 10 megabeasts
I looked into the embark mode, but came to the conclusion that due to the way people played embark, to get something close to a real world data would require several hours’ worth of embark tests. This would be functionally prohibitive to the bench suite, and so I decided to focus on world generation.
DFMark outputs the time to run any given test, so this is what we use for the output. We loop the small test for as many times possible in 10 minutes, the medium test for as many times in 30 minutes, and the large test for as many times in an hour.
Interestingly Intel’s hardware likes Dwarf Fortress. It is primarily single threaded, and so a high IPC and a high frequency is what matters here.
Dolphin v5.0 Emulation: Link
Many emulators are often bound by single thread CPU performance, and general reports tended to suggest that Haswell provided a significant boost to emulator performance. This benchmark runs a Wii program that ray traces a complex 3D scene inside the Dolphin Wii emulator. Performance on this benchmark is a good proxy of the speed of Dolphin CPU emulation, which is an intensive single core task using most aspects of a CPU. Results are given in seconds, where the Wii itself scores 1051 seconds.
The Dolphin software has the ability to output a log, and we obtained a version of the benchmark from a Dolphin developer that outputs the display into that log file. The benchmark when finished will automatically try to close the Dolphin software (which is not normal behavior) and brings a pop-up on display to confirm, which our benchmark script can detects and remove. The log file is fairly verbose, so the benchmark script iterates through line-by-line looking for a regex match in line with the final time to complete.
The final result is a table that looks like this:
Dolphin does still have one flaw – about one in every 10 runs it will hang when the benchmark is complete and can only be removed by memory via a taskkill command or equivalent. I have not found a solution for this yet, and due to this issue Dolphin is one of the final tests in the benchmark run. If the issue occurs and I notice, I can close Dolphin and re-run the test by manually opening the benchmark in Dolphin to run again, and allow the script to pick up the final dialog box when done.
Rendering tests, compared to others, are often a little more simple to digest and automate. All the tests put out some sort of score or time, usually in an obtainable way that makes it fairly easy to extract. These tests are some of the most strenuous in our list, due to the highly threaded nature of rendering and ray-tracing, and can draw a lot of power. If a system is not properly configured to deal with the thermal requirements of the processor, the rendering benchmarks is where it would show most easily as the frequency drops over a sustained period of time. Most benchmarks in this case are re-run several times, and the key to this is having an appropriate idle/wait time between benchmarks to allow for temperatures to normalize from the last test.
Blender 2.83 LTS: Link
One of the popular tools for rendering is Blender, with it being a public open source project that anyone in the animation industry can get involved in. This extends to conferences, use in films and VR, with a dedicated Blender Institute, and everything you might expect from a professional software package (except perhaps a professional grade support package). With it being open-source, studios can customize it in as many ways as they need to get the results they require. It ends up being a big optimization target for both Intel and AMD in this regard.
For benchmarking purposes, Blender offers a benchmark suite of tests: six tests varying in complexity and difficulty for any system of CPUs and GPUs to render up to several hours compute time, even on GPUs commonly associated with rendering tools. Unfortunately what was pushed to the community wasn’t friendly for automation purposes, with there being no command line, no way to isolate one of the tests, and no way to get the data out in a sufficient manner.
To that end, we fell back to one rendering a frame from a detailed project. Most reviews, as we have done in the past, focus on one of the classic Blender renders, known as BMW_27. It can take anywhere from a few minutes to almost an hour on a regular system. However now that Blender has moved onto a Long Term Support model (LTS) with the latest 2.83 release, we decided to go for something different.
We use this scene, called PartyTug at 6AM by Ian Hubert, which is the official image of Blender 2.83. It is 44.3 MB in size, and uses some of the more modern compute properties of Blender. As it is more complex than the BMW scene, but uses different aspects of the compute model, time to process is roughly similar to before. We loop the scene for 10 minutes, taking the average time of the completions taken. Blender offers a command-line tool for batch commands, and we redirect the output into a text file.
As this is a time to complete benchmark, as we strap in the big multi-core projects, those bars will shrink a lot. But there are still a couple of minutes to shave off!
Corona 1.3: Link
Corona is billed as a popular high-performance photorealistic rendering engine for 3ds Max, with development for Cinema 4D support as well. In order to promote the software, the developers produced a downloadable benchmark on the 1.3 version of the software, with a ray-traced scene involving a military vehicle and a lot of foliage. The software does multiple passes, calculating the scene, geometry, preconditioning and rendering, with performance measured in the time to finish the benchmark (the official metric used on their website) or in rays per second (the metric we use to offer a more linear scale).
The standard benchmark provided by Corona is interface driven: the scene is calculated and displayed in front of the user, with the ability to upload the result to their online database. We got in contact with the developers, who provided us with a non-interface version that allowed for command-line entry and retrieval of the results very easily. We loop around the benchmark five times, waiting 60 seconds between each, and taking an overall average. The time to run this benchmark can be around 10 minutes on a Core i9, up to over an hour on a quad-core 2014 AMD processor or dual-core Pentium.
One small caveat with this benchmark is that it needs online access to run, as the engine will only operate with a license from the licensing servers. For both the GUI and the command-line version, it does this automatically, but it does throw up an error if it can’t get a license. The good thing is that the license is valid for a week, so it doesn’t need further communications until that time runs out.
Corona for use with 3ds Max is on version 1.7, rather than the version 1.3 that the benchmark is currently on. We are told that there are some minor improvements to performance, and a newer benchmark will be produced at some point in the future. One of the benefits of an older benchmark in this case is should a budding reverse engineer actually pull out the Corona libraries to use without a license.
Crysis CPU-Only Gameplay
One of the most oft used memes in computer gaming is ‘Can It Run Crysis?’. The original 2007 game, built in the Crytek engine by Crytek, was heralded as a computationally complex title for the hardware at the time and several years after, suggesting that a user needed graphics hardware from the future in order to run it. Fast forward over a decade, and the game runs fairly easily on modern GPUs.
But can we also apply the same concept to pure CPU rendering? Can a CPU, on its own, render Crysis? Since 64 core processors entered the market, one can dream. So we built a benchmark to see whether the hardware can.
For this test, we’re running Crysis’ own GPU benchmark, but in CPU render mode. This is a 2000 frame test, with low settings. Initially we planned to run the test over several resolutions, however realistically speaking only 1920×1080 matters at this point.
We’re seeing some regular consumer CPUs pull into the double digits! Unfortunately our Xeon system didn’t want to run the Crysis test at all, so it will get interesting as we move to the big AMD silicon.
POV-Ray 3.7.1: Link
A long time benchmark staple, POV-Ray is another rendering program that is well known to load up every single thread in a system, regardless of cache and memory levels. After a long period of POV-Ray 3.7 being the latest official release, when AMD launched Ryzen the POV-Ray codebase suddenly saw a range of activity from both AMD and Intel, knowing that the software (with the built-in benchmark) would be an optimization tool for the hardware.
We had to stick a flag in the sand when it came to selecting the version that was fair to both AMD and Intel, and still relevant to end-users. Version 3.7.1 fixes a significant bug in the early 2017 code that was advised against in both Intel and AMD manuals regarding to write-after-read, leading to a nice performance boost.
The benchmark automation uses the BENCHMARK flag that runs the built-in multi-threaded tests and dumps the results into the clipboard. This is a full text dump, and so the actual score needs to be parsed through a quick regex check, then multiple runs can be put together to find an average. Watching the benchmark shows the result as it is being processed, however the score is an average of the processing for the last X number of seconds – the benchmark starts fast, then slows down, and speeds up towards the end, likely due to the complexity of the scene being rendered as it progresses.
The benchmark can take over 20 minutes on a slow system with few cores, or around a minute or two on a fast system, or seconds with a dual high-core count EPYC. Because POV-Ray draws a large amount of power and current, it is important to make sure the cooling is sufficient here and the system stays in its high-power state. Using a motherboard with a poor power-delivery and low airflow could create an issue that won’t be obvious in some CPU positioning if the power limit only causes a 100 MHz drop as it changes P-states.
We also use POV-Ray as our load generator in our per-core power testing. For this we take the benchmark.pov file and force it to render at an 8K resolution, which requires several minutes even on a dual socket EPYC system. Then we take the power measurement 60 seconds into the test.
We have a couple of renderers and ray tracers in our suite already, however V-Ray’s benchmark came through for a requested benchmark enough for us to roll it into our suite. Built by ChaosGroup, V-Ray is a 3D rendering package compatible with a number of popular commercial imaging applications, such as 3ds Max, Maya, Undreal, Cinema 4D, and Blender.
We run the standard standalone benchmark application, but in an automated fashion to pull out the result in the form of kilosamples/second. We run the test six times and take an average of the valid results.
Nothing much to say here, as it seems to scale quite well.
Cinebench R20: Link
Another common stable of a benchmark suite is Cinebench. Based on Cinema4D, Cinebench is a purpose built benchmark machine that renders a scene with both single and multi-threaded options. The scene is identical in both cases. The R20 version means that it targets Cinema 4D R20, a slightly older version of the software which is currently on version R21. Cinebench R20 was launched given that the R15 version had been out a long time, and despite the difference between the benchmark and the latest version of the software on which it is based, Cinebench results are often quoted a lot in marketing materials.
Results for Cinebench R20 are not comparable to R15 or older, because both the scene being used is different, but also the updates in the code bath. The results are output as a score from the software, which is directly proportional to the time taken. Using the benchmark flags for single CPU and multi-CPU workloads, we run the software from the command line which opens the test, runs it, and dumps the result into the console which is redirected to a text file. The test is repeated for 10 minutes for both ST and MT, and then the runs averaged.
Cinebench R20 in single threaded mode is often used as a good example for IPC performance. Cinebench does not often tax the main memory or the storage, meaning that the base core design and cache structure play an important part in performance. It is one metric that Intel used to love to show its dominance over AMD, however since AMD launched Ryzen, R20 is now in AMD’s wheelhouse and Intel actively promotes it as a non-real-world benchmark.
The multi-threaded test is also somewhat DRAM and storage agnostic, showing how lots of threads can get a high result. One of the limits of R15 is that it would max out at 64 threads, then be inconsistent in performance. The R20 test is built to be a lot longer, but uses the same threading approach as before – spawn a number of worker threads equal to the CPU threads in the system, and then send batches of work to each thread rather than killing and respawning them. This means there is no overhead due to thread generation.
One of the interesting elements on modern processors is encoding performance. This covers two main areas: encryption/decryption for secure data transfer, and video transcoding from one video format to another.
In the encrypt/decrypt scenario, how data is transferred and by what mechanism is pertinent to on-the-fly encryption of sensitive data – a process by which more modern devices are leaning to for software security.
Video transcoding as a tool to adjust the quality, file size and resolution of a video file has boomed in recent years, such as providing the optimum video for devices before consumption, or for game streamers who are wanting to upload the output from their video camera in real-time. As we move into live 3D video, this task will only get more strenuous, and it turns out that the performance of certain algorithms is a function of the input/output of the content.
HandBrake 1.32: Link
Video transcoding (both encode and decode) is a hot topic in performance metrics as more and more content is being created. First consideration is the standard in which the video is encoded, which can be lossless or lossy, trade performance for file-size, trade quality for file-size, or all of the above can increase encoding rates to help accelerate decoding rates. Alongside Google’s favorite codecs, VP9 and AV1, there are others that are prominent: H264, the older codec, is practically everywhere and is designed to be optimized for 1080p video, and HEVC (or H.265) that is aimed to provide the same quality as H264 but at a lower file-size (or better quality for the same size). HEVC is important as 4K is streamed over the air, meaning less bits need to be transferred for the same quality content. There are other codecs coming to market designed for specific use cases all the time.
Handbrake is a favored tool for transcoding, with the later versions using copious amounts of newer APIs to take advantage of co-processors, like GPUs. It is available on Windows via an interface or can be accessed through the command-line, with the latter making our testing easier, with a redirection operator for the console output.
Finding the right combination of tests to use in our Handbrake benchmark is often difficult. There is no one test that covers all scenarios – streamers have different demands to production houses, then there’s video call transcoding that also requires some measure of CPU performance.
This time around, we’re probing a range of quality settings that seem to fit a number of scenarios. We take the compiled version of this 16-minute YouTube video about Russian CPUs at 1080p30 h264 and convert into three different files:
- 1080p30 to 480p30 ‘Discord’: x264, Max Rate 2100 kbps, High Profile 4.0, Medium Preset, 30 Peak FPS
- 1080p30 to 720p30 ‘YouTube’: x264, Max Rate 25000 kbps, High Profile 3.2, Medium Preset, 30 Peak FPS
- 1080p30 to 4K60 ‘HEVC’: H.265, Quality 24, Auto Profile, Slow Preset, 60 Peak FPS
These three presets, starting from the same source video, give a scaling set of results showing what might be best for video chat (1), streaming (2), or upconverting offline (3). We will see most CPUs can manage (1) in realtime, (2) might be a challenge, and (3) is only for the expensive systems.
This is one of our longer tests, with a good system taking between 30-60 minutes and slow systems taking several hours. The LQ test typically favors a good single thread speed due to the low amount of memory per frame, while the HEVC test can spread out to more cores and favors big and fast caches as well as good memory.
7-Zip 1900: Link
The first compression benchmark tool we use is the open-source 7-zip, which typically offers good scaling across multiple cores. 7-zip is the compression tool most cited by readers as one they would rather see benchmarks on, and the program includes a built-in benchmark tool for both compression and decompression.
The tool can either be run from inside the software or through the command line. We take the latter route as it is easier to automate, obtain results, and put through our process. The command line flags available offer an option for repeated runs, and the output provides the average automatically through the console. We direct this output into a text file and regex the required values for compression, decompression, and a combined score.
Ultimately this is more of a synthetic benchmark test. We perform a real-world test using WinRAR.
Algorithms using AES coding have spread far and wide as a ubiquitous tool for encryption. Again, this is another CPU limited test, and modern CPUs have special AES pathways to accelerate their performance. We often see scaling in both frequency and cores with this benchmark. We use the latest version of TrueCrypt and run its benchmark mode over 1GB of in-DRAM data. Results shown are the GB/s average of encryption and decryption.
For automating this test, it becomes one of the most annoying tests. There is no easy way to automate aside from loading the software and emulating keyboard button presses (with enough time between each button press to ensure even the slowest system has enough time). There is no obvious end-point for the benchmark, so we have to have a fixed point to cater for all CPUs (in this case, 240 seconds, even though the fastest CPUs can be done in under 10 second), and the results are listed in a table which makes it hard to get the information from. To get this information, we end up taking a localized screenshot into the clipboard, and saving that file in our results directory. The final result has to be read from the image, and put into the database, manually.
At this point we’re not sure why the 3900XT doesn’t do better on this test, as we believe it should be nearer 13.5-14.0 GB/s. Needs more investigation.
Despite the TrueCrypt software having ended its development, it still leaves us with a good general test for AES (and other cryptography algorithms). The results typically scale across cores and frequency, with the newest processors having AES accelerators that push the results almost to the maximum bandwidth of the DRAM.
WinRAR 5.90: Link
For the 2020 test suite, we move to the latest version of WinRAR in our compression test. WinRAR in some quarters is more user friendly that 7-Zip, hence its inclusion. Rather than use a benchmark mode as we did with 7-Zip, here we take a set of files representative of a generic stack
- 33 video files , each 30 seconds, in 1.37 GB,
- 2834 smaller website files in 370 folders in 150 MB,
- 100 Beat Saber music tracks and input files, for 451 MB
This is a mixture of compressible and incompressible formats. The results shown are the time taken to encode the file. Due to DRAM caching, we run the test for 20 minutes times and take the average of the last five runs when the benchmark is in a steady state.
For automation, we use AHK’s internal timing tools from initiating the workload until the window closes signifying the end. This means the results are contained within AHK, with an average of the last 5 results being easy enough to calculate.
Along with single-core frequency, WinRAR benefits a lot from memory bandwidth as well as cache type. We have seen in past that the eDRAM enabled processors give a good benefit to software like WinRAR, and it is usually where we see the biggest DRAM differences.
In order to gather data to compare with older benchmarks, we are still keeping a number of tests under our ‘legacy’ section. This includes all the former major versions of CineBench (R15, R11.5, R10) as well as x264 HD 3.0 and the first very naïve version of 3DPM v2.1. We won’t be transferring the data over from the old testing into Bench, otherwise it would be populated with 200 CPUs with only one data point, so it will fill up as we test more CPUs like the others.
The other section here is our web tests.
Benchmarking using web tools is always a bit difficult. Browsers change almost daily, and the way the web is used changes even quicker. While there is some scope for advanced computational based benchmarks, most users care about responsiveness, which requires a strong back-end to work quickly to provide on the front-end. The benchmarks we chose for our web tests are essentially industry standards – at least once upon a time.
It should be noted that for each test, the browser is closed and re-opened a new with a fresh cache. We use a fixed Chromium version for our tests with the update capabilities removed to ensure consistency.
Mozilla Kraken 1.1
Automation involves loading the direct webpage where the test is run and putting it through. All CPUs finish the test in under a couple of minutes, so we put that as the end point and copy the page contents into the clipboard before parsing the result. Each run of the test on most CPUs takes from half-a-second to a few seconds
We loop through the 10-run test four times (so that’s a total of 40 runs), and average the four end-results. The result is given as time to complete the test, and we’re reaching a slow asymptotic limit with regards the highest IPC processors.
Google Octane 2.0
It is worth noting that in the last couple of Intel generations, there was a significant uptick in performance for Intel, likely due to one of the optimizations from the code base that filtered through into the microarchitecture. Octane is still an interesting comparison point for systems within a similar microarchitecture scope.
Our test goes through the list of frameworks, and produces a final score indicative of ‘rpm’, one of the benchmarks internal metrics. Rather than use the main interface, we go to the admin interface through the about page and manage the results there. It involves saving the webpage when the test is complete and parsing the final result.
We repeat over the benchmark for a dozen loops, taking the average of the last five.
Most of the people in our industry have a love/hate relationship when it comes to synthetic tests. On the one hand, they’re often good for quick summaries of performance and are easy to use, but most of the time the tests aren’t related to any real software. Synthetic tests are often very good at burrowing down to a specific set of instructions and maximizing the performance out of those. Due to requests from a number of our readers, we have the following synthetic tests.
Linux OpenSSL Speed
In our last review, and on my twitter, I opined about potential new benchmarks for our suite. One of our readers reached out to me and stated that he was interested in looking at OpenSSL hashing rates in Linux. Luckily OpenSSL in Linux has a function called ‘speed’ that allows the user to determine how fast the system is for any given hashing algorithm, as well as signing and verifying messages.
OpenSSL offers a lot of algorithms to choose from, and based on a quick Twitter poll, we narrowed it down to the following:
- rsa2048 sign and rsa2048 verify
- sha256 at 8K block size
- md5 at 8K block size
For each of these tests, we run them in single thread and multithreaded mode.
To automate this test, Windows Subsystem for Linux is needed. For our last benchmark suite I scripted up enabling WSL with Ubuntu 18.04 on Windows in order to run SPEC, so that stays part of the suite (and actually now becomes the biggest pre-install of the suite).
OpenSSL speed has some commands to adjust the time of the test, however the way the script was managing it meant that it never seemed to work properly. However, the ability to adjust how many threads are in play does work, which is important for multithreaded testing.
This test produces a lot of graphs, so for full reviews I might keep the rsa2048 ones and just leave the sha256/md5 data in Bench.
The AMD CPUs do really well in the sha256 test due to native support for SHA256 instructions.
GeekBench 4: Link
As a common tool for cross-platform testing between mobile, PC, and Mac, GeekBench is an ultimate exercise in synthetic testing across a range of algorithms looking for peak throughput. Tests include encryption, compression, fast Fourier transform, memory operations, n-body physics, matrix operations, histogram manipulation, and HTML parsing.
I’m including this test due to popular demand, although the results do come across as overly synthetic, and a lot of users often put a lot of weight behind the test due to the fact that it is compiled across different platforms (although with different compilers). Technically GeekBench 5 exists, however we do not have a key for the pro version that allows for command line processing.
For reviews we are posting the overall single and multi-threaded results.
I have noticed that Geekbench 4 over Geekbench 5 does rely a lot on its memory subtests, which could play a factor if we have to test limited-access CPUs in different systems.
AIDA64 Memory Bandwidth: Link
Speaking of memory, one of the requests we have had is to showcase memory bandwidth. Lately AIDA64 has been doing some good work in providing automation access, so for this test I used the command line and some regex to extract the data from the JSON output. AIDA also provides screenshots of its testing windows as required.
For the most part, we expect CPUs of the same family with the same memory support to not differ that much – there will be minor differences based on the exact frequency of the time, or how the power budget gets moved around, or how many cores are being fed by the memory at one time.
LinX 0.9.5 LINPACK
One of the benchmarks I’ve been after for a while is just something that outputs a very simple GFLOPs FP64 number, or in the case of AI I’d like to get a value for TOPs at a given level of quantization (FP32/FP16/INT8 etc). The most popular tool for doing this on supercomputers is a form of LINPACK, however for consumer systems it’s a case of making sure that the software is optimized for each CPU.
LinX has been a popular interface for LINPACK on Windows for a number of years. However the last official version was 0.6.5, launched in 2015, before the latest Ryzen hardware came into being. HWTips in Korea has been updating LinX and has separated out into two versions, one for Intel and one for AMD, and both have reached version 0.9.5. Unfortunately the AMD version is still a work in progress, as it doesn’t work on Zen 2.
There does exist a program called Linpack Extreme 1.1.3, which claims to be updated to use the latest version of the Intel Math Kernel Libraries. It works great, however the way the interface has been designed means that it can’t be automated for our uses, so we can’t use it.
For LinX 0.9.5, there also is a difficulty of what parameters to put into LINPACK. The two main parameters are problem size and time – choose a problem size too small, and you won’t get peak performance. Choose it too large, and the calculation can go on for hours. To that end, we use the following algorithms as a compromise:
- Memory Use = Floor(1000 + 20*sqrt(threads)) MB
- Time = Floor(10+sqrt(threads)) minutes
For a 4 thread system, we use 1040 MB and run for 12 minutes.
For a 128 thread system, we use 1226 MB and run for 21 minutes.
We take the peak value of GFLOPs by the output as a result. Unfortunately the output doesn’t come out in a clean UTF-8 regular output, which means this is one result we have to read direct from the results file.
As we add in more CPUs, this graph should look more interesting. If a Zen2 version is deployed, we will adjust our script accordingly.
SPEC2017 and SPEC2006 is a series of standardized tests used to probe the overall performance between different systems, different architectures, different microarchitectures, and setups. The code has to be compiled, and then the results can be submitted to an online database for comparison. It covers a range of integer and floating point workloads, and can be very optimized for each CPU, so it is important to check how the benchmarks are being compiled and run.
We run the tests in a harness built through Windows Subsystem for Linux, developed by our own Andrei Frumusanu. WSL has some odd quirks, with one test not running due to a WSL fixed stack size, but for like-for-like testing is good enough. SPEC2006 is deprecated in favor of 2017, but remains an interesting comparison point in our data. Because our scores aren’t official submissions, as per SPEC guidelines we have to declare them as internal estimates from our part.
For compilers, we use LLVM both for C/C++ and Fortan tests, and for Fortran we’re using the Flang compiler. The rationale of using LLVM over GCC is better cross-platform comparisons to platforms that have only have LLVM support and future articles where we’ll investigate this aspect more. We’re not considering closed-sourced compilers such as MSVC or ICC.
clang version 8.0.0-svn350067-1~exp1+0~20181226174230.701~1.gbp6019f2 (trunk)
clang version 7.0.1 (ssh://email@example.com/flang-compiler/flang-driver.git
-mfma -mavx -mavx2
Our compiler flags are straightforward, with basic –Ofast and relevant ISA switches to allow for AVX2 instructions. We decided to build our SPEC binaries on AVX2, which puts a limit on Haswell as how old we can go before the testing will fall over. This also means we don’t have AVX512 binaries, primarily because in order to get the best performance, the AVX-512 intrinsic should be packed by a proper expert, as with our AVX-512 benchmark.
To note, the requirements for the SPEC licence state that any benchmark results from SPEC have to be labelled ‘estimated’ until they are verified on the SPEC website as a meaningful representation of the expected performance. This is most often done by the big companies and OEMs to showcase performance to customers, however is quite over the top for what we do as reviewers.
For each of the SPEC targets we are doing, SPEC2006 rate-1, SPEC2017 speed-1, and SPEC2017 speed-N, rather than publish all the separate test data in our reviews, we are going to condense it down into individual data points. The main three will be the geometric means from each of the three suites.
A fourth metric will be a scaling metric, indicating how well the nT result scales to the 1T result for 2017, divided by the number of cores on the chip.
The per-test data will be a part of Bench.
Experienced users should be aware that 521.wrf_r, part of the SPEC2017 suite, does not work in WSL due to the fixed stack size. It is expected to work with WSL2, however we will cross that bridge when we get to it. For now, we’re not giving wrf_r a score, which because we are taking the geometric mean rather than the average, should not affect the results too much.
There is a class of synthetic tests which are valid – these tests are designed to probe the system underneath to find out how it works, rather than focusing on performance. As part of our test suite, these benchmarks are run for the sake of us having insights into the data, however this data isn’t in a form that we can transcribe into Bench, but it will certainly be part of reviews into how each different microarchitecture is evolving. Sometimes these tests are called ‘Microbenchmarks’, however some of our tests are more than that.
Full x86 Instruction Throughput/Latency Analysis
The full version of one of the software packages we use has a tool in order to be able to test every single x86 and x64 instruction that is in the official documentation, along with variants of those instructions. Our full instruction test goes through all of them, including x87 and the latest AVX-512, to see what works and how performant they are.
For this benchmark, we acquired a command line version. There is a secondary caveat, and that it requires turbo to be disabled – luckily we can do that on the command line as well.
As the core count of modern CPUs is growing, we are reaching a time when the time to access each core from a different core is no longer a constant. Even before the advent of heterogeneous SoC designs, processors built on large rings or meshes can have different latencies to access the nearest core compared to the furthest core. This rings true especially in multi-socket server environments.
But modern CPUs, even desktop and consumer CPUs, can have variable access latency to get to another core. For example, in the first generation Threadripper CPUs, we had four chips on the package, each with 8 threads, and each with a different core-to-core latency depending on if it was on-die or off-die. This gets more complex with products like Lakefield, which has two different communication buses depending on which core is talking to which.
If you are a regular reader of AnandTech’s CPU reviews, you will recognize our Core-to-Core latency test. It’s a great way to show exactly how groups of cores are laid out on the silicon. This is a custom in-house test built by Andrei, and we know there are competing tests out there, but we feel ours is the most accurate to how quick an access between two cores can happen.
There is one caveat, and that’s the danger of putting too much emphasis on the comparative values. These are latency values, and in terms of performance, only particularly relevant if a workload is core-to-core latency sensitive. There are always plenty of other elements in play, such as prefetchers and buffers, which likely matter more to performance.
This is another in-house test built by Andrei, which showcases the access latency at all the points in the cache hierarchy for a single core. We start at 2 KiB, and probe the latency all the way through to 256 MB, which for most CPUs sits inside the DRAM (before you start saying 64-core TR has 256 MB of L3, it’s only 16 MB per core, so at 20 MB you are in DRAM).
Part of this test helps us understand the range of latencies for accessing a given level of cache, but also the transition between the cache levels gives insight into how different parts of the cache microarchitecture work, such as TLBs. As CPU microarchitects look at interesting and novel ways to design caches upon caches inside caches, this basic test should prove to be very valuable.
Both AMD and Intel over the past few years have introduced features to their processors that speed up the time from when a CPU moves from idle into a high powered state. The effect of this means that users can get peak performance quicker, but the biggest knock-on effect for this is with battery life in mobile devices, especially if a system can turbo up quick and turbo down quick, ensuring that it stays in the lowest and most efficient power state for as long as possible.
Intel’s technology is called SpeedShift, while AMD has CPPC2.
One of the issues though with this technology is that sometimes the adjustments in frequency can be so fast, software cannot detect them. If the frequency is changing on the order of microseconds, but your software is only probing frequency in milliseconds (or seconds), then quick changes will be missed. Not only that, as an observer probing the frequency, you could be affecting the actual turbo performance. When the CPU is changing frequency, it essentially has to pause all compute while it aligns the frequency rate of the whole core.
We wrote an extensive review analysis piece on this, called ‘Reaching for Turbo: Aligning Perception with AMD’s Frequency Metrics’, due to an issue where users were not observing the peak turbo speeds for AMD’s processors.
We got around the issue, again due to another fabulous Andrei tool, by making the frequency probing the workload causing the turbo. The software is able to detect frequency adjustments on a microsecond scale, so we can see how well a system can get to those boost frequencies.
Our Frequency Ramp tool has already been in use in a number of reviews. Currently we’re seeing most Intel and AMD CPUs aim for a 16.6 ms idle-to-turbo scale, which equates to a single frame on a 60 Hz display – this is often enough for most user interaction situations.
A y-Cruncher Sprint
This last test is somewhat for my own edification. The y-cruncher website has a large about of benchmark data showing how different CPUs perform to calculate specific values of pi. Below these there are a few CPUs where it shows the time to compute moving from 25 million digits to 50 million, 100 million, 250 million, and all the way up to 10 billion, to showcase how the performance scales with digits (assuming everything is in memory).
This range of results, from 25 million to 250 billion, is something I’ve dubbed a ‘sprint’.
You might notice that not all of the cells are filled, and that is because as we move into the billions, the systems have to have 16/32/64 of memory or more in order to even attempt the task. Moreover, there is no element of consistency in the data – it’s all from a variety of places using different memory.
In order to get a complete set of data, I have written some code in order to perform a sprint on every CPU we test. It detects the DRAM, works out the biggest value that can be calculated with that amount of memory, and works up from 25 million digits. For the tests that go up to the ~25 billion digits, it only adds an extra 15 minutes to the suite for an 8-core Ryzen CPU.
Despite the advent of recent TV shows like Chernobyl, recreating the situation revolving around the 1986 Chernobyl nuclear disaster, the concept of nuclear fallout and the town of Pripyat have been popular settings for a number of games – mostly first person shooters. Chernobylite is an indie title that plays on a science-fiction survival horror experience and uses a 3D-scanned recreation of the real Chernobyl Exclusion Zone. It involves challenging combat, a mix of free exploration with crafting and non-linear story telling. While still in early access, it is already picking up plenty of awards.
I picked up Chernobylite while still in early access, and was impressed by its ingame benchmark, showcasing complex building structure with plenty of trees and structures where aliasing becomes important. The in-game benchmark is an on-rails experience through the scenery, covering both indoor and outdoor scenes – it ends up being very CPU limited in the way it is designed. We have taken an offline version of Chernobylite to use in our tests, and we are testing the following settings combinations:
- 360p Low
- 1440p Low,
- 4K Low
- 1080p Max
For automation purposes, the game has no flags to initiate benchmark mode. We delete the movies from the install directory to speed up entering the game, and use timers and keypresses to start the benchmark mode. The game puts out a benchmark results file, however this only shows average frame rates, not frame times. In-game settings are controlled by copying pre-arranged .ini files into the relevant location. We do as many runs within 10 minutes per resolution/setting combination, and then take averages.
All of our benchmark results can also be found in our benchmark engine, Bench.
Originally penned by Sid Meier and his team, the Civilization series of turn-based strategy games are a cult classic, and many an excuse for an all-nighter trying to get Gandhi to declare war on you due to an integer underflow. Truth be told I never actually played the first version, but I have played every edition from the second to the sixth, including the fourth as voiced by the late Leonard Nimoy, and it a game that is easy to pick up, but hard to master.
Benchmarking Civilization has always been somewhat of an oxymoron – for a turn based strategy game, the frame rate is not necessarily the important thing here and even in the right mood, something as low as 5 frames per second can be enough. With Civilization 6 however, Firaxis went hardcore on visual fidelity, trying to pull you into the game. As a result, Civilization can taxing on graphics and CPUs as we crank up the details, especially in DirectX 12.
For this benchmark, we are using the following settings:
- 480p Low
- 1440p Low
- 4K Low
- 1080p Max
For automation, Firaxis supports the in-game automated benchmark from the command line, and output a results file with frame times. We call the game, wait until the file is created, pull it out and regex the relevant data. For the in-game settings, because Civ 6 seems to require the hardware from the last run to be the same as the current run else it defaults to base settings, we delete the settings file, run the benchmark once, and then use regex on the generated settings files to call the relevant resolutions and quality settings. We do as many runs within 10 minutes per resolution/setting combination, and then take averages and percentiles.
All of our benchmark results can also be found in our benchmark engine, Bench.
Deus Ex is a franchise with a wide level of popularity. Despite the Deus Ex: Mankind Divided (DEMD) version being released in 2016, it has often been heralded as a game that taxes the CPU. It uses the Dawn Engine to create a very complex first-person action game with science-fiction based weapons and interfaces. The game combines first-person, stealth, and role-playing elements, with the game set in Prague, dealing with themes of transhumanism, conspiracy theories, and a cyberpunk future. The game allows the player to select their own path (stealth, gun-toting maniac) and offers multiple solutions to its puzzles.
DEMD has an in-game benchmark, an on-rails look around an environment showcasing some of the game’s most stunning effects, such as lighting, texturing, and others. Even in 2020, it’s still an impressive graphical showcase when everything is jumped up to the max. For this title, we are testing the following resolutions:
- 600p Low
- 1440p Low
- 4K Low
- 1080p Max
On automation, DEMD comes up a bit of a dud. Yes there’s an in-game benchmark that gives you a result, but it doesn’t output a file and there’s no way to call the benchmark from the command line. We load the game and use button presses to select the benchmark, and then to pull out the frame times, we use a FRAPs wrapper. DEMD is also one of very few games that do not have configuration files for resolution and quality – instead it is all in the registry, and requires adjusting the registry for every benchmark setting change. This is fairly rare, but it’s easy enough to do and there are no issues with changing hardware.
The benchmark runs for about 90 seconds. We do as many runs within 10 minutes per resolution/setting combination, and then take averages and percentiles.
All of our benchmark results can also be found in our benchmark engine, Bench.
Despite being one number less than Final Fantasy 15, because FF14 is a massively-multiplayer online title, there are always yearly update packages which give the opportunity for graphical updates too. In 2019, FFXIV launched its Shadowbringers expansion, and an official standalone benchmark was released at the same time for users to understand what level of performance they could expect. Much like the FF15 benchmark we’ve been using for a while, this test is a long 7-minute scene of simulated gameplay within the title. There are a number of interesting graphical features, and it certainly looks more like a 2019 title than a 2010 release, which is when FF14 first came out.
With this being a standalone benchmark, we do not have to worry about updates, and the idea for these sort of tests for end-users is to keep the code base consistent. For our testing suite, we are using the following settings:
- 768p Minimum,
- 1440p Minimum,
- 4K Minimum,
- 1080p Maximum
On the automation side of things, despite this benchmark being newer than the FF15 test, it doesn’t have as many features. There are no command line options, and key presses do not work, which means the benchmark has to be aligned with mouse movement and clicks for it to be initiated. Also, the benchmark does not automatically output a results file – there is a button on the benchmark interface to ‘save score’, which also has to be navigated to. With automation, we’re able to detect when the main benchmark window finishes in order to direct where our mouse movements go. The output unfortunately only gives us the average frame rate, and not percentiles.
Thankfully the settings file is a simple .ini file, and we can just copy over pre-built ones for each setting before we load up the benchmark. As with the other benchmarks, we do as many runs until 10 minutes per resolution/setting combination has passed, and then take averages. Realistically, because of the length of this test, this equates to two runs per setting.
All of our benchmark results can also be found in our benchmark engine, Bench.
Upon arriving to PC, Final Fantasy XV: Windows Edition was given a graphical overhaul as it was ported over from console. As a fantasy RPG with a long history, the fruits of Square-Enix’s successful partnership with NVIDIA are on display. The game uses the internal Luminous Engine, and as with other Final Fantasy games, pushes the imagination of what we can do with the hardware underneath us. To that end, FFXV was one of the first games to promote the use of ‘video game landscape photography’, due in part to the extensive detail even at long range but also with the integration of NVIDIA’s Ansel software, that allowed for super-resolution imagery and post-processing effects to be applied.
In preparation for the launch of the game, Square Enix opted to release a standalone benchmark. Using the Final Fantasy XV standalone benchmark gives us a lengthy standardized sequence to record, although it should be noted that its heavy use of NVIDIA technology means that the Maximum setting has problems – it renders items off screen. To get around this, we use the standard preset which does not have these issues. We use the following settings:
- 720p Standard
- 1080p Standard
- 4K Standard
- 8K Standard
For automation, the title accepts command line inputs for both resolution and settings, and then auto-quits when finished. This is what I consider the best type of benchmark! The output file however only deals with average frame rates, and when we started first using the benchmark, I created a FRAPs wrapper to record the six minute benchmark. As with the other benchmarks, we do as many runs until 10 minutes per resolution/setting combination has passed, and then take averages. Realistically, because of the length of this test, this equates to two runs per setting.
All of our benchmark results can also be found in our benchmark engine, Bench.
Albeit different to most of the other commonly played MMO or massively multiplayer online games, World of Tanks is set in the mid-20th century and allows players to take control of a range of military based armored vehicles. World of Tanks (WoT) is developed and published by Wargaming who are based in Belarus, with the game’s soundtrack being primarily composed by Belarusian composer Sergey Khmelevsky. The game offers multiple entry points including a free-to-play element as well as allowing players to pay a fee to open up more features. One of the most interesting things about this tank based MMO is that it achieved eSports status when it debuted at the World Cyber Games back in 2012.
World of Tanks enCore is a demo application for its new graphics engine penned by the Wargaming development team. Over time the new core engine has been implemented into the full game upgrading the games visuals with key elements such as improved water, flora, shadows, lighting as well as other objects such as buildings. The World of Tanks enCore demo app not only offers up insight into the impending game engine changes, but allows users to check system performance to see if the new engine runs optimally on their system. There is technically a Ray Tracing version of the enCore benchmark now available, however because it can’t be deployed standalone without the installer, we decided against using it. If that gets fixed, then we can look into it.
The benchmark tool comes with a number of presets:
- 768p Minimum
- 1080p Standard
- 1080p Max
- 4K Max (not a preset)
The WoT enCore tool does not have any command line options, and does not accept key presses, and so we have to automate mouse movements to select the relevant resolutions and settings before starting the benchmark. The odd one out is the 4K Max preset, because the benchmark doesn’t automatically have a 4K option – to get this part of our script edits the acceptable resolutions ini file, and then we can select 4K.
The benchmark outputs its own results file, with frame times, making it very easy to parse the data needed for average and percentiles.
All of our benchmark results can also be found in our benchmark engine, Bench.
As a big Borderlands fan, having to sit and wait six months for the EPIC Store exclusive to expire before we saw it on Steam felt like a long time to wait. The fourth title of the franchise, if you exclude the TellTale style-games, BL3 expands the universe beyond Pandora and its orbit, with the set of heroes (plus those from previous games) now cruising the galaxy looking for vaults and the treasures within. Popular Characters like Tiny Tina, Claptrap, Lilith, Dr. Zed, Zer0, Tannis, and others all make appearances as the game continues its cel-shaded design but with the graphical fidelity turned up. Borderlands 1 gave me my first ever taste of proper in-game second order PhysX, and it’s a high standard that continues to this day.
BL3 works best with online access, so it is filed under our online games section. This means we have to manage access to the same Steam account with a variety of API calls to make sure no two systems are trying to go online at the same time. BL3 is also one of our biggest downloads, requiring 100+ GB. Unfortunately Valve has changed how the way downloads work, so ideas like operating a local on-network Steam Cache for quick downloads no longer works (not only that, but BL3 seems to download from 2K, not Steam, making most of it uncacheable).
As BL3 supports resolution scaling, we are using the following settings:
- 360p Very Low
- 1440p Very Low
- 4K Very Low
- 1080p Badass
BL3 has its own in-game benchmark, which recreates a set of on-rails scenes with a variety of activity going on in each, such as shootouts, explosions, and wildlife. The benchmark has a curious element to it, in that if it detects the previous benchmark run was done on previous hardware, it resorts to default settings. In order to get around this, we adjusted the benchmark to delete the previous settings file, load up a default settings file, and use regex to adjust the settings required.
Unfortunately there is no command line option to initiate the benchmark, so we manipulate pauses and key presses in order to select the benchmark. The benchmark outputs its own results files, including frame times, which can be parsed for our averages/percentile data.
All of our benchmark results can also be found in our benchmark engine, Bench.
The F1 racing games from Codemasters have been popular benchmarks in the tech community, mostly for ease-of-use and that they seem to take advantage of any area of a machine that might be better than another. The 2019 edition of the game features all 21 circuits on the calendar, and includes a range of retro models and DLC focusing on the careers of Alain Prost and Ayrton Senna. Built on the EGO Engine 3.0, the game has been criticized similarly to most annual sports games, by not offering enough season-to-season graphical fidelity updates to make investing in the latest title worth it, however the 2019 edition revamps up the Career mode, with features such as in-season driver swaps coming into the mix. The quality of the graphics this time around is also superb, even at 4K low or 1080p Ultra.
To be honest, F1 benchmarking has been up and down in any given year. Since at least 2014, the benchmark has revolved around a ‘test file’, which allows you to set what track you want, which driver to control, what weather you want, and which cars are in the field. In previous years I’ve always enjoyed putting the benchmark in the wet at Spa-Francorchamps, starting the fastest car at the back with a field of 19 Vitantonio Liuzzis on a 2-lap race and watching sparks fly. In some years, the test file hasn’t worked properly, with the track not being able to be changed.
In the 2019 benchmark, the test file allows you to select a driver, a track, a position, how many laps, and the weather. It doesn’t allow you to set the grid how you see fit (perhaps due to sponsorship reasons?), but also one bug bear with this version is that they seem to have made the aggression in the benchmark zero, or worse. There’s never any overtaking. Perhaps because in previous years there was a chance that you could be crashed into, and the benchmark would stall? I never had that problem.
For our test, we put Alex Albon in the Red Bull in position #20, for a dry two-lap race around Austin. We test at the following settings:
- 768p Ultra Low
- 1440p Ultra Low
- 4K Ultra Low
- 1080p Ultra
In terms of automation, F1 2019 has an in-game benchmark that can be called from the command line, and the output file has frame times. That part is perfect. The problem is with the settings file – it is another settings file that requires the CPU and GPU listed to be the exact models in the correct string format, else it throws everything out and reverts to default. This is especially tricky if the CPU string has a series of extra white space that isn’t immediately noticeable.
In order to ensure we have the correct CPU/GPU, we delete the settings file and load up the game so it creates its own. We then regex the settings file created with the graphics settings we want for the test. Unlike Civilization 6, where you can choose individual presets in the settings file, every specific graphical settings needs to be adjusted – this leads to about 25 regex commands for the different parts that need adjusting.
Nonetheless, it’s all doable, and the benchmark works well. We have seen issues with CPUs that do not support AVX, perhaps indicating that F1 2019 can’t be run properly on Celerons or Pentiums.
All of our benchmark results can also be found in our benchmark engine, Bench.
The fifth title in Ubisoft’s Far Cry series lands us right into the unwelcoming arms of an armed militant cult in Montana, one of the many middles-of-nowhere in the United States. With a charismatic and enigmatic adversary, gorgeous landscapes of the northwestern American flavor, and lots of violence, it is classic Far Cry fare. Graphically intensive in an open-world environment, the game mixes in action and exploration with a lot of configurability.
Far Cry 5 is an Ubisoft game, which muddles the waters a bit. On a positive note, Ubisoft have made using UPlay with automation a lot easier in recent months, so that no longer becomes an issue even when there is a UPlay update. Also, Ubisoft is a big fan of the in-game benchmarks, which for Far Cry 5 is an on-rails recreation of in-game events showcasing close up and far viewing environments, plus explosions and gun fights.
Due to FC5 having a massive issue when it comes to what monitor you are using, we’ve lost extensive amounts of hair when dealing with different resolutions the game can detect on each monitor. In my home office, I have two brands of 4K monitor for our testing: Dell UP2415Qs and cheap £200 27-inch TN panels. Unfortunately, the game doesn’t like us changing the resolution in the results file when using the Dell monitors, resorting to 1080p but keeping the quality settings, and surprisingly the resolution scaling. In order to make the game more palatable across every monitor, on 7/20 (the day of this article), we decided to fix the resolution at 1080p and use a variety of different scaling factors to give the following:
- 720p Low (0.5x scaling)
- 1440p Low (1.3x scaling)
- 4K Low (2.0x scaling)
- 1440p Max (1.0x scaling)
Automating the game has been tough. Aside from the resolution issue, which has really only come about the day I’m writing this, this game has the same two downsides as every Ubisoft in-game benchmark: no easy entry, and a disaster of a results file (if it outputs one at all). The first point is like many of the games here – there is no simple command line to start the benchmark, and we have to resort to loading the game and manipulating key presses to get the benchmark started. The second point, on the disaster of a results file, is essentially hell.
It is a positive that Ubisoft outputs a file here. The negative is that the file is a HTML file, which showcases the average FPS and a graph of the FPS detected. At no point in the HTML file does it contain the frame times for each frame, but it does show the frames per second, as a value once per second in the graph. The graph in HTML form is a series of (x,y) co-ordinates scaled to the min/max of the graph, rather than the raw (second, FPS) data, and so using regex I carefully tease out the values of the graph, convert them into a (second, FPS) format, and take our values of averages and percentiles that way. Technically because we only have FPS data for every second of the test this isn’t the true percentile, but this is the best approximation we have.
Unfortunately because we’ve changed our Far Cry 5 setup today to ensure consistency, we have to current data at this time.
If anyone from Ubisoft wants to chat about building a benchmark platform that would not only help me but also every other member of the tech press build our benchmark testing platform to help our readers decide what is the best hardware to use on your games, please reach out to firstname.lastname@example.org. Some of the suggestions I want to give you will take less than half a day and it’s easily free advertising to use the benchmark over the next couple of years (or more).
All of our benchmark results can also be found in our benchmark engine, Bench.
Remembering the original Gears of War brings back a number of memories – some good, some involving online gameplay. The latest iteration of the franchise was launched as I was putting this benchmark suite together, and Gears Tactics is a high-fidelity turn-based strategy game with an extensive single player mode. As with a lot of turn-based games, there is ample opportunity to crank up the visual effects, and here the developers have put a lot of effort into creating effects, a number of which seem to be CPU limited.
Gears Tactics has an in-game benchmark, roughly 2.5 minutes of AI gameplay starting from the same position but doing slightly different things each time. Much like the racing games, this usually leads to some variation in the run-to-run data, so for this benchmark we are taking the geometric mean of the results. One of the biggest things that Gears Tactics can do is on the resolution scaling, supporting 8K, and so we are testing the following settings:
- 720p Low
- 4K Low
- 8K Low
- 1080p Ultra
For automation, Gears Tactics falls under one of our more frustrating titles. The resolution and quality are taken from a settings file, and this is easily transferable between machines. However one part of the settings file cannot be transferred – a specific hardwareID created by the CPU+GPU combination. To get this, we load up the game first without a settings file so it generates the value, then we extract that value to use in our pre-prepared settings files.
There is no command line option to get to the benchmark, and so we load up the game and have to navigate to the benchmark mode. While I was testing for this article, the game introduced an intro splash screen notifying users of an update, which meant I had to script a way to cancel that screen without exiting the game – luckily there was a small segment of the ‘accept’ box that doesn’t cover the ‘exit game’ box when the intro splash screen disappears (so it should be valid when that splash screen is removed, or replaced).
By far the biggest issue with the benchmark is the results output. The game showcases a mountain of data when the benchmark is finished, such as how much the benchmark was CPU limited and where, however none of that is ever exported into a file we can use. It’s just a screenshot which we have to read manually. If the developers could output all this data that they’ve clearly recorded and shown to the user, it would make my job a lot easier. I did consider putting the benchmark in a FRAPs wrapper, however the loading time for the test is too variable, along with the lack of a fixed seed for the AI actions that makes the benchmark itself a bit variable in length.
As with the other benchmarks, we do as many runs until 10 minutes per resolution/setting combination has passed. For this benchmark, we manually read each of the screenshots for each quality/setting/run combination. The benchmark does also give 95th percentiles and frame averages, so we can use both of these data points.
All of our benchmark results can also be found in our benchmark engine, Bench.
The highly anticipated iteration of the Grand Theft Auto franchise hit the shelves on April 14th 2015, with both AMD and NVIDIA to help optimize the title. At this point GTA V is super old, but still super useful as a benchmark – it is a complicated test with many features that modern titles today still struggle with. With rumors of a GTA 6 on the horizon, I hope Rockstar make that benchmark as easy to use as this one is.
GTA doesn’t provide graphical presets, but opens up the options to users and extends the boundaries by pushing even the hardest systems to the limit using Rockstar’s Advanced Game Engine under DirectX 11. Whether the user is flying high in the mountains with long draw distances or dealing with assorted trash in the city, when cranked up to maximum it creates stunning visuals but hard work for both the CPU and the GPU.
We are using the following settings:
- 720p Low
- 1440p Low
- 4K Low
- 1080p Max
For our test we have scripted a version of the in-game benchmark. The in-game benchmark consists of five scenarios: four short panning shots with varying lighting and weather effects, and a fifth action sequence that lasts around 90 seconds. We use only the final part of the benchmark, which combines a flight scene in a jet followed by an inner city drive-by through several intersections followed by ramming a tanker that explodes, causing other cars to explode as well. This is a mix of distance rendering followed by a detailed near-rendering action sequence, and the title thankfully spits out frame time data. The benchmark can also be called from the command line, making it very easy to use.
There is one funny caveat with GTA. If the CPU is too slow, or has too few cores, the benchmark loads, but it doesn’t have enough time to put items in the correct position. As a result, for example when running our single core Sandy Bridge system, the jet ends up stuck at the middle of an intersection causing a traffic jam. Unfortunately this means the benchmark never ends, but still amusing.
All of our benchmark results can also be found in our benchmark engine, Bench.
It’s great to have another Rockstar benchmark in the mix, and the launch of Red Dead Redemption 2 (RDR2) on the PC gives us a chance to do that. Building on the success of the original RDR, the second incarnation came to Steam in December 2019 having been released on consoles first. The PC version takes the open-world cowboy genre into the start of the modern age, with a wide array of impressive graphics and features that are eerily close to reality.
For RDR2, Rockstar kept the same benchmark philosophy as with Grand Theft Auto V, with the benchmark consisting of several cut scenes with different weather and lighting effects, with a final scene focusing on an on-rails environment, only this time with mugging a shop leading to a shootout on horseback before riding over a bridge into the great unknown. Luckily most of the command line options from GTA V are present here, and the game also supports resolution scaling. We have the following tests:
- 384p Minimum
- 1440p Minimum
- 8K Minimum
- 1080p Max
For that 8K setting, I originally thought I had the settings file at 4K and 1.0x scaling, but it was actually set at 2.0x giving that 8K. For the sake of it, I decided to keep the 8K settings.
For automation, despite RDR2 taking a lot of inspiration from GTA V in its command line options and benchmark, the only feature it didn’t take was the actual flag that runs the benchmark. As a result, we have to use key presses on loading into the game in order to run the benchmark and get the data. It’s also worth noting that the benchmark results file is only dumped after the game has quit, which can cause issues in scripting when dealing with pauses (slow CPUs take a long time to load the test). The settings file accepts our pre-prepared versions along with the command line for ignoring new hardware, and the output files when you get them have all the frame times as required.
All of our benchmark results can also be found in our benchmark engine, Bench.
Strange Brigade is based in 1903’s Egypt, and follows a story which is very similar to that of the Mummy film franchise. This particular third-person shooter is developed by Rebellion Developments which is more widely known for games such as the Sniper Elite and Alien vs Predator series. The game follows the hunt for Seteki the Witch Queen, who has arose once again and the only ‘troop’ who can ultimately stop her. Gameplay is cooperative centric with a wide variety of different levels and many puzzles which need solving by the British colonial Secret Service agents sent to put an end to her reign of barbaric and brutality.
The game supports both the DirectX 12 and Vulkan APIs and houses its own built-in benchmark as an on-rails experience through the game. For quality, the game offers various options up for customization including textures, anti-aliasing, reflections, draw distance and even allows users to enable or disable motion blur, ambient occlusion and tessellation among others. Strange Brigade supports Vulkan and DX12, and so we test on both.
- 720p Low
- 1440p Low
- 4K Low
- 1080p Ultra
The automation for Strange Brigade is one of the easiest in our suite – the settings and quality can be changed by pre-prepared .ini files, and the benchmark is called via the command line. The output includes all the frame time data.
All of our benchmark results can also be found in our benchmark engine, Bench.
Truth be told, the concept of a project to benchmark almost 700-900 processors has been rattling around in my head for a few years. I actually wrote the first segment of this article way back in 2016. However, over the course of 2016 and 2017, building new testing suites has taken longer, priorities changed, and the project didn’t so much as get shelved as somewhat pushed down the order on a semi-permanent basis until there was an ideal opening. Those of you who have followed the site may have noticed my responsibilities increase over time, darting 200k miles a year around the world. It can be difficult to keep a large project buoyant without constant attention.
Between 2016 and today, we’ve still be churning though the tests on the hardware, and updating our benchmark database with as many chips as we can find, even if it wasn’t under a governed project. The most recent version of our CPU2019 Bench has 272 CPUs with data recorded on up to 246 benchmark data points for each, just to showcase perhaps what one person can do in a given year. However, the focus of Bench being a specific project wasn’t necessarily a primary target of the site. With the launch of our Bench2020 suite, with a wider variety of tests and analysis, we’re going to put this into action. That’s not to say I have more time than normal (I might have to propose what we can do about getting an intern), but with the recent pandemic keeping me on the ground, it does give a chance to take stock about what users are really after.
With #CPUOverload, the goal is to do more than before, and highlight the testing we do. This is why I’ve spent the best part of 25-30 pages talking about benchmark sustainability, usefulness, automation, and why every benchmark is relevant to some of our user base. Over the last decade, as a hardware tester providing results online for free, one obvious change in the requests from our readers has been to include specific benchmarks that target them, rather than generic ones related to their field. That’s part of what this project is, combined with testing at scale.
Users also want to find their exact CPU, and compare it to an exact CPU potential upgrade – a different model, at least in today’s naming conventions, might have different features. So getting exactly what you want to compare is always going to be better – being able to see how your Intel Core i5-2380P in that Dell OEM system you have had for 7 years compares to a newer Ryzen 7 2700E or Xeon E-2274G is all part of what makes this project exciting. That essence of scale, and trying to test as many different CPU variants as possible, is going to be a vital part of this project.
Obviously the best place to start with a project like this is two-fold: popular processors and modern processors. These get the most attention, and so covering the key parts from Coffee Lake, Kaby Lake, Ryzen and HEDT are going to be high on our list to start. The hardware that we’re also testing for review also gets a priority, so that’s why you might start seeing some Zhaoxin or Xeon/EPYC data enter Bench very soon. One funny element is that if you were to start listing what might be ‘high importance processors’, it very easily come back with a list of between 25-100 SKUs, with various i9/i7/i5/i3 and R7/R5/R3/APU as well as Intel/AMD HEDT and halo parts in there – that’s already 10 segments! Some users might want us to focus on the cheap Xeon parts coming out of China too. Obviously whatever our users want to see be tested, we want to hear about it.
As part of this project, we are also expecting to look at some retrospective performance. Future articles might include ‘how well does Ivy Bridge i5 perform today’, or given AMD and Intel’s tendency to compare five year products to each other, we are looking to do that too, in both short and longer form articles.
When I first approached AMD and Intel’s consumer processor divisions about this project, wondering how much interest there would be for it, both came back to me with positive responses. They filled in a few of my hardware gaps, but cautioned that even as internal PR teams, they won’t have access to most chips, especially the older ones. This means that as we process through the hardware, we might start reaching out to other partners in order to fill in the gaps.
Is testing 900 CPUs ultimately realistic? Based on the hardware I have today, if I had access to Narnia, I could provide data for about 350 of the CPUs. In reality, with our new suite, each CPU takes 20-30 hours to test on the CPU benchmarks, and another 10 hours for the gaming tests. Going for 50-100 CPUs/month might be a tough ask, but let’s see how we get on. We have these dozen or so CPUs in the graphs here to start.
Of course, comments are always welcome. If there’s a CPU, old or new, you want to see tested, then please drop a comment below. It will help how I arrange which test beds get priority.