Alongside with today’s extremely large and comprehensive CPU line-up announcement including the new Cortex-X2, Cortex-A710, Cortex-A510, new DSU-110 and new interconnects, we’re seeing the announcement of Arm’s newest Mali GPU line-up. Similar to the CPU family, we’re also seeing an extensive line-up announcement with the new Mali-G710 flagship series, the G510 middle-range, and the new ultra-area efficient Mali-G310.
The new GPU series follows up in the same Valhall GPU family that was started back in 2019 with the Mali-G77 and seen minor improvements with the Mali-G78 in last year’s announcements and seen silicon adoption in this year’s SoC’s such as the Kirin 9000, Exynos 2100 or the new MediaTek Dimensity SoCs.
At the high-end, the Mali-G710 is a direct successor to the Mali-G78 and is a relatively straightforward generational improvement in terms of what it’s aiming for: the highest possible performance that Arm’s architects can achieve in a Mali GPU. The Mali-G610 is a branding exercise that differentiates the same microarchitecture as the G710 at lower core counts, aiming to aid partners to better differentiate flagship products from the “premium” segment.
The Mali-G510 is a successor to the 2019 Mali-G57 and is a major upgrade to Arm’s mid-range portfolio, bringing extremely large generational performance boosts as well as power efficiency gains over the predecessor.
Finally, the new Mali-G310 is a new Valhall based low-end entry that represents a multi-generation architectural bump over the old-in-the-tooth Bifrost based Mali-G31 and targets the low-end area-efficiency focused market where we see hundred of billions of low-cost devices and other embedded markets such as smart TVs.
As gross overview, the highlight today for most readers will be focused around the new Mali-G710 flagship GPU. The improvements that the company is promising is roughly a +20% boost in performance in a ISO-process node GPU configuration compared to a comparable Mali-G78 GPU. Similarly, at similar performance, the new GPU design promises a -20% reduction in power consumption and thus also energy efficiency gain.
Recently, Arm has also made a focus on Machine Learning on the GPU and here the new design is promising a larger +35% boost in performance.
As a continuation of the Valhall GPU architecture, the cornerstone characteristics of the new G710’s execution engines are similar and roughly the same as what we’ve covered in the past generation Mali-G77 and Mali-G78.
Amongst the larger changes we saw with Valhall was the shift from a wavefront/warp size of 8 towards 16, with dual datapaths (clusters) per execution engine, resulting in a 32 FMA/core design that we saw in the G77 and G78.
The ISA is said to have seen larger improvements that was designed with new modern APIs such as Vulkan – it’s always quite hard to quantify the impact such changes have on the overall performance and efficiency of a GPU.
What’s new in the Mali-G710 is the addition of a second execution engine, effectively doubling up on the compute performance per shader core of the Valhall architecture. In a sense, Arm here is re-adopting some of its scaling means that we had seen in past generation Mali architectures, such as compared to when the Mali-G76 had for example three execution engines per shader core.
In the above slide, the “8x” and “4x” metrics are in regards to the throughput per cycle per core, and we can see by the metrics that other functional blocks of the GPU have also doubled up in terms of throughput to keep up with the doubled up compute execution throughput of the execution engines.
The new G710 includes a brand-new texture unit that is now able to handle up to 8 bilinear texels per clock, and Arm has generally optimised the new design to be significantly more area efficient, giving the new TMU a +50% performance density advantage.
Within the execution engine Arm continues to employ two processing units or clusters of processing elements, and in that regard, we don’t see that much difference between the generations, however if we look deeper into the actual processing unit there are changes to the blocks:
In the simplest and fundamental explanation, what we’re seeing is a shift from a single instance of 16-wide (warp wide) processing elements and execution units, to four instances of 4-wide execution units. The throughput between the designs doesn’t change, but the new microarchitecture gives more dedicated resources to the processing elements and allows for better structing for better efficiency.
Overall, the new execution engine design doubles up the FMA’s per clock per core, which is somewhat obvious, but also has the benefit of lowering the energy distribution within the shader core from the execution engine by 20%.
A further very large highlight of the G710 is the replacement of the traditional “Job Manager” with the new “Command Stream Frontend”, which handles scheduling and handling of draw-calls. The CSF introduces a new CPU of undisclosed nature, and for the first time will now also introduce a firmware layer to Mali GPUs.
The goals of the design is achieving more flexible and scalable performance for more complex graphical workloads while at the same time improving on system CPU power efficiency by reducing driver overhead by providing it with a very light weight submission path. It helps for simplified support of API features such as state inheritance and secondary buffers, and handling timing sensitive applications such as VR or time-warp applications. Synchronisation events also greatly benefit from the move closer to the hardware and the reduction of latency that this enables.
The firmware is closely couples to the hardware and handles requests from the host, or command buffer completion notifications, reduces overhead of things such as protected entry exit, or even allows for emulation of API features that don’t yet exist in the hardware through additional instructions.
The new hardware has been redesigned from the ground-up to be able to keep up with modern content and allow for the throughput of job submission into other GPU units. Arm here claims that the new CSF allows for up to 5 million drawcalls per second.
Overall, the new G710 microarchitecture seems very interesting and in particular seems to want to address some API overhead related weaknesses of Arm’s Mali GPUs. How this plays out remains to be seen, but from the advertised performance and power efficiency gains of 20% this generation, it seems like a solid improvement, although in these figures wouldn’t be quite sufficient to alter the competitive landscape in the mobile market.
The Mali-G610 is the same microarchitecture as the G710, only with a different name with core configurations lower than 7 cores.
In the mid-range, the new Mali-G510 and Mali-G310 are generational improvements over the market predecessors, the G57 and G31. Representing new major jumps in the microarchitectures, these new designs are unnaturally large performance jumps for Arm’s mid-range and low-end offerings.
From a very high view, the G510 scales up from 2 cores to 6 cores, but offers differentiation through changing the number of clusters within one of the execution units per core, or also changing the type of texture units in use, either a 4x throughput unit, or an 8x unit.
On the execution engine side, we always have two execution engines, but it’s possible to configure down one of them to only contain a single cluster, effectively reducing the compute part of the core from a 64 FMA/cycle design to a 48 FMA/cycle design. The reason for such granularity is that the usual customers for such GPUs have hyper-optimised use-cases and will configure their GPU implementation for a specific use-case and criteria, and only use the bare minimum configuration to fulfil those demands, in the smallest possible area.
Arm here puts an emphasis on the 10 different configuration options of the G510 IP, all having different compute or fill rate optimised performance points. It might be quite a bit unintuitive for the every-day reader to understand the need for such configurability, but there are non-mobile markets which really care about every fraction of a mm² when it comes to implementations.
Scaling further below the G510 is the new G310. This GPU is actually a major performance leap compared to the previous generation smallest Mali IP offering, the G31, as we’re seeing the move from a Bifrost architecture to the new Valhall design.
Here, we’re seeing adoption of the new execution engine design, but allowing to further scale down the clusters to only one per EE, and also allowing only one EE in the minimum configuration, allowing scaling of 16, 32, 48 or 64 FMA per shader core. The texture units also scale down to 2 texels/cycle units at minimum, and also seeing a varying unit that’s scaled down compared to its bigger siblings.
The G310 is exclusively a single shader core design, so the configurations are exclusively achieved through changing the different execution units within that core. Unfortunately Arm doesn’t seem to plan out any public naming scheme for the various configurations, so it will be all up to the vendors to actually do any kind of disclosure.
Coming into Arm’s GPU briefing, the company’s engineers had first noted that they’re expecting a more significant boost in performance and efficiency for this generation, and had admitted that the G78 last year wasn’t as large a leap due to timing constraints.
In general, while the G710 is offering some very solid performance figures that are coming in at +20% performance and efficiency, given the context of Arm’s current Mali GPU performance in chipsets such as Samsung’s Exynos, HiSilicon’s Kirin or MediaTek’s Dimensity, that’s still not a very large leap in terms of the competitive positioning at the very high-end of the market, which admittedly is what caters the most media attention and generally what we are most enthusiastic about.
Lacking a larger magnitude shift or step-function upgrade, Arm’s high-end prospects in the coming year don’t look very great: With HiSilicon being effectively cut off from the rest of the semi ecosystem, and with Samsung having confirmed their shift to AMD RDNA GPUs in the next generation of Exynos GPUs, this only leaves MediaTek as the last “big” mobile Mali licensee. MediaTek is actually doing very well and picking up market share in the market gap that HiSilicon has left behind, but they have never truly really gone for the flagship segment for SoCs or for big GPU designs. It very well may be that we simply won’t see a high-end Mali-G710 implementation.
Arm’s Mali GPU design philosophy had always been a double-edged sword, particularly because they’re trying to cater for such a wide market with a very similar microarchitecture. While the high-end looks quite bleak because of this, the mid-range and now the low-end look extremely promising.
The new Mali-G510 and G310 GPUs are extremely large generational upgrades with quite massive performance upgrades. Arm had stated that they’ve shipped over a billion Mali GPUs in 2020, with metrics such as 80% market-share in DTV and 50% market-share in smartphones. The new mid-range and low-end GPUs are poised to continue to offer extremely strong offerings in those segments. If those DTV markets hare numbers are accurate, it also means that Arm has managed to gain a significant amount of market share from Imagination, the only other surviving GPU IP vendor. The new more comprehensive mid-range and low-end series are poised to further strengthen and maintain those gains in those markets.