DirectX12 – Multithread Architecture: A First Approach

In this post, I want to show my progress with DirectX12. I am going to show the base of my current DirectX12 architecture.

My biggest personal project was a mini graphics engine that was designed from scratch with DirectX11 that supported deferred shading and PBR. That project did not take advantage of multiple threads because I was doing tasks sequentially:

  • Load resources (textures, models, buffers creation, etc.)
  • In the main loop:
    • Update camera based on user input
    • Clear render targets and depth stencil
    • For each geometry:
      • Execute drawing tasks (deferred rendering geometry and lighting phases, post processing)
      • Present

When DirectX12 was released, I knew that a lot of different changes not only API changes but paradigm changes were coming. Since that moment, I decided to take advantage of multiple threads and design a new application in a multithreaded way from scratch too.


I decided to implement a task-based architecture for parallel draw submission. I considered having a “Master Render Task” for work submission with a couple of “Worker Tasks” for command list recording and resource creation. The idea is that the “Master Render Task” spawns these “Worker Tasks” that record command lists that are executed by a “Command List Processor Task.”

In the following images, you can see the big picture of this architecture, step by step. Each blue box is being executed in a different thread. Each red box is an Intel TBB task that is executed to do something (initialization/command list construction) and then terminated. Each violet box is simply a class instance of a system (Camera, ShadersManager, etc.)

Now, I am going to describe the different steps involved:

  • App class is instantiated


  • Master Render Task is spawned


  • App constructs InitTasks and passes them to Master Render Task, which is going to build command builder tasks.


  • After that, App continues working in its own thread (processing events, etc.)


  • The Master Render Task runs in its own thread too and continually spawns command builder tasks which record command lists (executed by Command List Processor task)


  • The Command List Processor task (which also runs in its own thread), consumes command lists and execute them (i.e. they are ready to be sent to GPU command queue)


Now, I am going to describe these different boxes in more detail.


I have several managers used for different purposes. They are

  • CommandManager for ID3D12CommandQueue, ID3D12GraphicsCommandList, and ID3D12CommandAllocator
  • PSOManager for ID3D12PipelineState
  • ResourceManager for ID3D12Resource, ID3D12DescriptorHeap, and ID3D12Fence
  • RootSignatureManager for ID3D12RootSignature
  • ShaderManager for ID3DBlob (shaders and root signatures hlsl files)

Each of them has methods to create the desired resource and get a unique id to get/erase it. All of them are thread safe because they use Intel TBB concurrent hash maps + an atomic integer variable to get a unique id. Here is the PSOManager implementation (remaining managers are similar, and you can found them in the repository link at the end of this post) and the unique id generation method (that as you can see is very trivial but useful)




Command List Processor

It is a task that has the responsibility to extract command lists from its concurrent queue and execute them. It tries to extract M (which in my case is 3) command lists before calling ID3D12CommandQueue::ExecuteCommandLists() unless you do not have enough of them in the queue. This is to reduce the number of command lists executions. It is an Intel TBB task that is executed all the time and independently. Its main function looks like the following


Master Render Task

It is an Intel TBB task that has the responsibility to initialize Direct3D systems and managers, construct command builder tasks and also spawn them. It also spawns the Command List Processor, that is going to consume command lists that are recorded by the command builder tasks. Its main function looks like the following




It is a class responsible for mouse, keyboard, and camera initialization and the most important task: MasterRenderTask. Its main function looks like the following


Init Task

Class that based on its input (InitTaskInput), initializes its output (CmdBuilderTaskInput), when it is executed by MasterRenderTask. InitTaskInput contains information about shaders location, PSO settings, root signature and geometry information, etc. CmdBuilderTaskInput contains pointers to data like PSO, root signature, command lists, and allocators, etc.

The user should fill a list of InitTasks before initializing the scene. It looks like the following




Cmd Builder Task

A class that based on its input (CmdBuilderTaskInput) initialized previously by an InitTask records a command list and store it in the concurrent command list queue provided by CommandListProcessor. It looks like the following




Swap Chain and Render Targets

For performance reasons, after we executed all command lists for the current frame, we should not wait until GPU finishes, to begin with, the next frame (GPU will be idle, and we should avoid that). Instead, we should have several queued frames to keep the GPU busy.

I read the following in a NVIDIA link about DirectX12 recommendations( )

“Don’t forget that there’s a per swap-chain limit of 3 queued frames before DXGI will start to block in Present()”

I am doing something similar in my architecture. If N is the number of swap chain buffers, then I have N -1 queued frames. In my case, N is equal to 4, but this parameter is easily tweakable. What happens if your CPU already sent all command lists for all your queued frames, but the GPU did not finish to execute command lists for the first one? Then you should use fences to avoid the following situation. We do this with the function MasterRenderTask::SignalFenceAndPresent(), that is called at the end of MasterRenderTask::ExecuteCmdBuilderTasks() method (previously shown at MasterRenderTask section)


and here you can see the swap chain creation method



For testing purposes, I am going to use a sphere as the base geometry. There will be 16000 spheres. I am going to record M command lists in CmdBuilderTasks, and each command list will draw N spheres. Finally, N* M will always be equal to 16000.

Each command list will have N * 4 commands (approximately, because I am ignoring commands for render target setting, PSO, etc. ) I wrote N * 4 because for each sphere I will need to call SetGraphicsRootDescriptorTable(), IASetVertexBuffers(),  IASetIndexBuffer(), and DrawIndexedInstanced(). Here, I am not trying to implement the best solution for drawing a big number of spheres in different places, because it is not, but to build a scene for benchmarking.

I distributed the spheres randomly in the scene inside a rectangular area of 500 x 500 centered at the origin. It looks like the following


The hardware I have is:

  • Intel i7 6700K
  • 32 GB Ram Corsair Vengeance
  • 1 TB WD Caviar Black
  • Nvidia GTX 680

I am going to use a swap chain with 4 buffers and 3 queued frames. These are the results.


As you can see, if we ignore multithreading capabilities and record all the commands in a single command list, then we are only using a single cor, and also GPU could be idle while waits until that single command list are sent to its queue.

Also, we can see that having 4 command builder threads recording 16000 commands each one, or having 16 command builder threads recording 4000 commands each one, give similar results. The cause of this could be that the processor has 4 cores.

As I used Intel TBB, I needed to take into account parallel_for’s grain size parameter. It is important to understand it and to know how to use it. A good reference I used is the following


If you are interested in the code, you can find it at this link. It is continually being updated, but most part of them should be the same. Feel free to use it as you want.


3 thoughts on “DirectX12 – Multithread Architecture: A First Approach

  1. Hey Nicolas, great article.
    I’m wondering if you could do a follow up article showing how to profile this architecture and see how it behaves under heavy load. It could be useful to see thread utilization, latency, etc.


    1. Hi, Diego. I changed the architecture a little bit so this First Approach is a little old. Once I finish to implement area lights I plan to do a benchmark trying to stress the system. Thanks for the recommendation!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s