just came across a very good blog by Nvidia on work graphs from 2 years ago
GPU-driven rendering has long been a major goal for many game applications. It enables better scalability for handling large virtual scenes and reduces cases where the CPU could bottleneck a game’s…
Work graphs new functionality
D3D12 already exposes functionality to aid in GPU-driven rendering, as mentioned previously. This section highlights the new functionality introduced by work graphs, compared to existing functionality.
Dynamic shader selection
Each node in the work graph can choose which of its children to run. The decision is driven by the producer’s shader code itself. This enables decisions to be determined by information generated by the GPU in a previous node or workload.
On the other hand, ExecuteIndirect is confined to work under the state it was launched with, most notably the shader specified by the pipeline state object. An application that needs to launch different shaders depending on GPU-side data has no choice but to issue a series of SetPipelineState and ExecuteIndirect calls, or rely on inefficient uber shaders to cover only some of the potential possibilities.
Implicit micro-dependency model
Rendering a frame involves executing several major passes, such as depth, geometry, or lighting passes. Within each pass, data is processed in parallel, where each unit of data goes through several sequential operations. Resource barriers are usually placed between the operations to ensure data processing is completed by the previous operation before moving to the next.
A work graph expresses this dependency implicitly by producer nodes passing records to children nodes. Children node shaders will only run when the producer has completed writing the record, implying that the data is fully ready for consumption by the child. Note that the scope of work graph producer-consumer dependencies are on the data record scope, whereas a resource barrier operates on all accesses to a resource.
The work graph dependency model is fine-grained compared to barriers. This can translate to better occupancy on the GPU, as dependent work can launch earlier instead of waiting for a barrier to finish. Records can immediately pass from the producer to the consumer node and need not be fully flushed across algorithm steps as is the case for Dispatch-ResourceBarrier sequences.
Figure 2 illustrates how the workloads are executed in each case. On the left, two Dispatch calls separated by a ResourceBarrier. Each row represents a producer thread-group (green) and its consumer thread-group (blue). On the right, the same workloads run with a work graph.

Figure 2. A comparison of workload execution
Work graphs overview
Shader Model 6.8 for D3D12, among many other features, marks the official release of work graphs. The term ‘graph’ in the name holds up well to its definition: a collection of nodes connected by edges. In work graphs, nodes perform tasks (“work”) and pass data to other nodes across the graph edges.
But what is this work that a node executes? Is it a command such as a Dispatch call? A single thread running a certain shader? Or perhaps a group of threads running the same shader?
The answer is, all of the above. Each node has a shader that is launched in a certain configuration of the programmer’s choice. This configuration, or launch mode, can be a full dispatch grid (broadcast launch) or compute threads run either independently of each other (thread launch) or potentially collectively (coalescing launch). Note that Thread Launch work can be gathered to run in a wave where possible, but each thread will still have its inputs independent of other threads.
A connection to another node is realized by choosing the target node and passing data to it. This resembles what is typically known as continuation in graph terminology. The target node receives the data and runs outside its caller’s context. There is no call stack in this system, just data cascading from the top to the bottom of the graph.
Units of data, called records, drive the entire execution of the work graph. To launch a node, a record must be written for it. The node’s shader is then launched in the chosen launch mode, and consumes that record as input. The record is a packed structure of data filled by the producer. The producer could be the CPU’s command DispatchGraph, or any node in the work graph. A node consuming the record could be thought of as a child of the producer node.

Figure 1. A work graph with a root node producing records to three children. Each child’s launch mode is different, resulting in different total threads launched for each node per single input record