Multithreading Exercises #1 – Bouncing Particles

In this post, we are going to talk about an exercise that should be solved using multithreading techniques.

Scenario

particle.png

  • We have a particle system with N particles. These particles will be bouncing inside a box.
  • Each particle position is generated randomly within a predefined range (MIN_POS < x, y, z < MAX_POS).
  • Each particle velocity is generated randomly within a predefined range (MIN_VEL < x, y, z < MAX_VEL)
  • Each particle is represented by a vertex and drawn using DirectX11. Each vertex in vertex buffer has only its position, and it is expanded to a camera facing quad in the geometry shader.

#define QUAD_VERTICES (4)

struct Input {
    float4 PosVS : POSITION_VIEW;
};

cbuffer cbPerFrame : register (b0) {
    float4x4 Projection;
    float QuadHalfSize;
};

struct Output {
    float4 PosH : SV_POSITION;
    float2 TexCoord : TEXCOORD;
};

[maxvertexcount(QUAD_VERTICES)]
void
main(const in point Input input[1], inout TriangleStream<Output> triangleStream) {
    // Compute vertices positions and texture coordinates based on
    // a quad whose center position is mQuadCenterPosV
    Output output;

    output.PosH = mul(input[0].PosVS + float4(-QuadHalfSize, QuadHalfSize, 0.0f, 0.0f), Projection);
    output.TexCoord = float2(0.0f, 1.0f);
    triangleStream.Append(output);

    output.PosH = mul(input[0].PosVS + float4(QuadHalfSize, QuadHalfSize, 0.0f, 0.0f), Projection);
    output.TexCoord = float2(1.0f, 1.0f);
    triangleStream.Append(output);

    output.PosH = mul(input[0].PosVS + float4(-QuadHalfSize, -QuadHalfSize, 0.0f, 0.0f), Projection);
    output.TexCoord = float2(0.0f, 0.0f);
    triangleStream.Append(output);

    output.PosH = mul(input[0].PosVS + float4(QuadHalfSize, -QuadHalfSize, 0.0f, 0.0f), Projection);
    output.TexCoord = float2(1.0f, 0.0f);
    triangleStream.Append(output);

    triangleStream.RestartStrip();
}

Problem & Hardware

  • Maximize FPS for 1kkk particles that bounce in a 50×50 box.
  • We are going to run our application in an AMD Phenom II X6 1090T, 8GB Ram, GTX 680.

Solutions

#1 Simulation+Rendering in the same thread

In this case, we run the simulation and rendering steps sequentially as you can see in the following picture.

single_thread

The code is the following:

// Single Threaded Simulation
for (size_t i = 0; i < NUM_ENTITIES; ++i) {     mPositions[i].x += mVelocities[i].x * mElapsedTime;     mPositions[i].y += mVelocities[i].y * mElapsedTime;     mPositions[i].z += mVelocities[i].z * mElapsedTime;     if (-POSITION_OFFSET > mPositions[i].x || mPositions[i].x > POSITION_OFFSET) {
        mVelocities[i].x *= -1.0f;
    }
    if (-POSITION_OFFSET > mPositions[i].y || mPositions[i].y > POSITION_OFFSET) {
        mVelocities[i].y *= -1.0f;
    }
    if (-POSITION_OFFSET > mPositions[i].z || mPositions[i].z > POSITION_OFFSET) {
        mVelocities[i].z *= -1.0f;
    }
}
draw();

and its performance is the following:

1a

As you can see in the previous source code, there are 3 branches in updatePosition(). I changed branches per ternary operators in the following way:

// Single Threaded Simulation
for (size_t i = 0; i < NUM_ENTITIES; ++i) {
    mPositions[i].x += mVelocities[i].x * mElapsedTime;
    mPositions[i].y += mVelocities[i].y * mElapsedTime;
    mPositions[i].z += mVelocities[i].z * mElapsedTime;
    mVelocities[i].x *= (-POSITION_OFFSET < mPositions[i].x && mPositions[i].x < POSITION_OFFSET) ? 1.0f : -1.0f;
    mVelocities[i].y *= (-POSITION_OFFSET < mPositions[i].y && mPositions[i].y < POSITION_OFFSET) ? 1.0f : -1.0f;
    mVelocities[i].z *= (-POSITION_OFFSET < mPositions[i].z && mPositions[i].z < POSITION_OFFSET) ? 1.0f : -1.0f;
}
draw();

and this was the result:

1b.png

With some machines, the first will run faster (highly pipelined processors: no branching, optimized ternary operator). Other machines, will run quicker with the second form (simpler).

#2 Multithreaded Simulation + Rendering in the same thread

The previous solution runs all the simulation in a single thread, and it can be computed in a parallel way because particles move independently of other particles. For this purpose, we are going to run the simulation in parallel, and after it finishes, we will begin the rendering phase. You can see this scheme in the following picture:

g.png

The code is the following:

// Multithreaded Simulation
tbb::parallel_for(tbb::blocked_range<size_t>(0, NUM_ENTITIES),
[=](const tbb::blocked_range<size_t>& r) {
    for (size_t i = r.begin(); i < r.end(); ++i) {         mPositions[i].x += mVelocities[i].x * mElapsedTime;         mPositions[i].y += mVelocities[i].y * mElapsedTime;         mPositions[i].z += mVelocities[i].z * mElapsedTime;         if (-POSITION_OFFSET > mPositions[i].x || mPositions[i].x > POSITION_OFFSET) {
            mVelocities[i].x *= -1.0f;
        }
        if (-POSITION_OFFSET > mPositions[i].y || mPositions[i].y > POSITION_OFFSET) {
            mVelocities[i].y *= -1.0f;
        }
        if (-POSITION_OFFSET > mPositions[i].z || mPositions[i].z > POSITION_OFFSET) {
            mVelocities[i].z *= -1.0f;
        }
    }
});
draw();

and its performance is the following:

2.png

#3  Simulation and Rendering in independent threads

The previous solution needed to wait until parallel execution of simulation step finished before beginning rendering step. Simulation and rendering could be independent. We can use 2 buffers: one buffer is consumed by rendering thread while in the other, simulation thread is storing new results. Once simulation finishes, we can switch buffers (pointers) and rendering thread will consume fresh information while in the simulation thread we compute simulation results again. You can see this scheme in the following picture:

h.png

The code is the following:

// Simulation thread
void simulation(XMFLOAT3* *currentPosBuffer, XMFLOAT3* pos1, XMFLOAT3* pos2, XMFLOAT3* vel, const float* elapsedTime) {
    ASSERT(currentPosBuffer);
    ASSERT(pos1);
    ASSERT(pos2);
    ASSERT(vel);
    ASSERT(elapsedTime);

    XMFLOAT3* primaryPosBuffer = pos2;
    while (*currentPosBuffer) {
        tbb::parallel_for(tbb::blocked_range<size_t>(0, NUM_ENTITIES),
            [=](const tbb::blocked_range<size_t>& r) {
                for (size_t i = r.begin(); i < r.end(); ++i) {                     primaryPosBuffer[i].x = (*currentPosBuffer)[i].x + vel[i].x * *elapsedTime;                     primaryPosBuffer[i].y = (*currentPosBuffer)[i].y + vel[i].y * *elapsedTime;                     primaryPosBuffer[i].z = (*currentPosBuffer)[i].z + vel[i].z * *elapsedTime;                     if (-POSITION_OFFSET > primaryPosBuffer[i].x || primaryPosBuffer[i].x > POSITION_OFFSET) {
                        vel[i].x *= -1.0f;
                    }
                    if (-POSITION_OFFSET > primaryPosBuffer[i].y || primaryPosBuffer[i].y > POSITION_OFFSET) {
                        vel[i].y *= -1.0f;
                    }
                    if (-POSITION_OFFSET > primaryPosBuffer[i].z || primaryPosBuffer[i].z > POSITION_OFFSET) {
                        vel[i].z *= -1.0f;
                    }
                }
            }
        );
        XMFLOAT3* tmp = primaryPosBuffer;
        primaryPosBuffer = *currentPosBuffer;
        *currentPosBuffer = tmp;
    }
}
and its performance is the following:
3

In conclusion, this was a problem that apparently was sequential, because we wanted to render results from a simulation, and it is evident that we need simulation results first, but as we saw, we can run simulation and rendering separately, in different threads, in an asynchronous way, making them independent:

c

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s