Instancing vs Geometry Shader vs Vertex Shader – Round 2

In this opportunity, I want to do the same I did in my previous post but with spheres instead of boxes.

The sphere is more complicated than a box because:

  • It has more vertices
  • It is more complex to generate (trigonometric functions are not cheap)

Sphere Generation

We use the following algorithm to generate spheres in CPU

void CreateSphere(const XMFLOAT3 centerPos, const float radius, const unsigned int sliceCount, const unsigned int stackCount, std::vector<XMFLOAT3>& positions, std::vector<unsigned int>& indices) {
   ASSERT(radius > 0.0f);
   ASSERT(sliceCount > 0);
   ASSERT(stackCount > 1);

   const unsigned int indexOffset = static_cast<unsigned int> (positions.size());

   positions.reserve(positions.size() + (stackCount - 1) * (sliceCount + 1) + 2);
   indices.reserve(indices.size() + 6 * sliceCount + 6 * (sliceCount + 1 * stackCount - 2));

   //
   // Compute the vertices stating at the top pole and moving down the stacks.
   //

   // Poles: note that there will be texture coordinate distortion as there is
   // not a unique point on the texture map to assign to the pole when mapping
   // a rectangular texture onto a sphere.
   const XMFLOAT3 topVertex(centerPos.x, centerPos.y + radius, centerPos.z);
   const XMFLOAT3 bottomVertex(centerPos.x, centerPos.y - radius, centerPos.z);

   positions.push_back(topVertex);

   const float phiStep = XM_PI / stackCount;
   const float thetaStep = 2.0f * XM_PI / sliceCount;

   // Compute vertices for each stack ring (do not count the poles as rings).
   for (unsigned int i = 1; i < stackCount; ++i) {
       const float phi = i * phiStep;
       const float sinPhi = sinf(phi);
       const float cosPhi = cosf(phi);

       // Vertices of ring.
       for (unsigned int j = 0; j < sliceCount; ++j) {
           const float theta = j * thetaStep;

           XMFLOAT3 v = centerPos;

           // spherical to cartesian
           v.x += radius * sinPhi * cosf(theta);
           v.y += radius * cosPhi;
           v.z += radius * sinPhi * sinf(theta);

           positions.push_back(v);
       }
   }

   positions.push_back(bottomVertex);

   //
   // Compute indices for top stack. The top stack was written first to the vertex buffer
   // and connects the top pole to the first ring.
   //

   for (unsigned int i = 1; i < sliceCount; ++i) {
       indices.push_back(indexOffset + 0);
       indices.push_back(indexOffset + i + 1);
       indices.push_back(indexOffset + i);
   }

   //
   // Compute indices for inner stacks (not connected to poles).
   //

   // Offset the indices to the index of the first vertex in the first ring.
   // This is just skipping the top pole vertex.
   unsigned int baseIndex = indexOffset + 1;
   const unsigned int ringVertexCount = sliceCount + 1;
   for (unsigned int i = 0; i < stackCount - 2; ++i) {
       for (unsigned int j = 0; j <= sliceCount; ++j) {
           indices.push_back(indexOffset + i * ringVertexCount + j);
           indices.push_back(indexOffset + i * ringVertexCount + j + 1);
           indices.push_back(indexOffset + (i + 1) * ringVertexCount + j);

           indices.push_back(indexOffset + (i + 1) * ringVertexCount + j);
           indices.push_back(indexOffset + i * ringVertexCount + j + 1);
           indices.push_back(indexOffset + (i + 1) * ringVertexCount + j + 1);
       }
   }

   //
   // Compute indices for bottom stack. The bottom stack was written last to the vertex buffer
   // and connects the bottom pole to the bottom ring.
   //

   // South pole vertex was added last.
   const unsigned int southPoleIndex = static_cast<unsigned int>(positions.size()) - 1 - indexOffset;

   // Offset the indices to the index of the first vertex in the last ring.
   baseIndex = southPoleIndex - ringVertexCount;

   for (unsigned int i = 0; i < sliceCount; ++i) {
       indices.push_back(baseIndex + southPoleIndex);
       indices.push_back(baseIndex + baseIndex + i);
       indices.push_back(baseIndex + baseIndex + i + 1);
   }
 }

Test Scenario

  • Video Card: Nvidia GTX 680
  • Graphics API: DirectX 11
  • No Multisampling
  • Back buffer resolution: 1920 x 1080
  • Geometry: A sphere. We generate each sphere center position randomly. We use a uniform distribution function.

It is important to mention that we choose to have 7 stacks and 6 slices, then each sphere has 44 vertices and 246 indices when generated in CPU. In Geometry Shader we generate a TriangleStream, it only understands about vertices that form triangles, but not about indices. Geometry Shader can generate at most 1024 scalars (DirectX 11 – Shader 5.0 – GTX 680). Each vertex has 4 scalars (x, y, z, w), that is why we cannot choose more stacks or slices.

In the following picture, you can see 1000 spheres that are uniformly distributed.

image

Instancing Technique

Input Layout

The position is the position of each vertex and direction is a vector to translate current instance vertices.

 D3D11_INPUT_ELEMENT_DESC inputElementDescriptions[] = {
{ POSITION, 0, DXGI_FORMAT_R32G32B32_FLOAT, 0, 0, D3D11_INPUT_PER_VERTEX_DATA, 0 },
{ DIRECTION, 0, DXGI_FORMAT_R32G32B32_FLOAT, 1, 0, D3D11_INPUT_PER_INSTANCE_DATA, 1 },
};

Buffers

  • Vertex buffer: 44 vertices representing sphere vertices positions
  • Index buffer: 246 indices representing indices to build 82 triangles
  • Instancing buffer: NUM_SPHERES direction vectors (float3) that represent direction to translate each box vertex.

Shaders

Vertex shader: Translates vertex position by instanced direction.

struct Input {
    float3 PosOS : POSITION;
    float3 DirOS : DIRECTION;
    uint InstanceId: SV_InstanceID;
};
struct Output {
   float4 PosH : SV_POSITION;
   float3 Color : COLOR;

};
cbuffer CBufferPerFrame : register (b0) {
    float4x4 WorldViewProjection;
}
Output main(const Input input, const uint vertexId : SV_VertexId) {
    Output output = (Output)0;
    output.PosH = mul(float4(input.PosOS + input.DirOS, 1.0f), WorldViewProjection);
    const float colorComp = (vertexId % 3) * 0.5f;
    output.Color = float3(colorComp, 0.0f, colorComp);
    return output;
}

Pixel shader:

struct PSInput {
   float4 PosH : SV_POSITION;
   float3 Color : COLOR;
};

float4
main(in PSInput input) : SV_TARGET {
    return float4(input.Color, 1.0f);
}
 Geometry Shader Technique

Input Layout

Box center position.

 D3D11_INPUT_ELEMENT_DESC inputElementDescriptions[] = {
{ POSITION, 0, DXGI_FORMAT_R32G32B32_FLOAT, 0, 0, D3D11_INPUT_PER_VERTEX_DATA, 0 },
};

Buffers

  • Vertex buffer: NUM_SPHERES vertices representing sphere center position.

Shaders

Vertex shader: Coordinate space transformations only

struct Input {
    float3 PosOS : POSITION;
};
struct Output {
    float4 PosWS : POSITION;
};
cbuffer CBufferPerFrame : register (b0) {
    float4x4 World;
}
Output main(const Input input) {
    Output output = (Output)0;
    output.PosWS = mul(float4(input.PosOS, 1.0f), World);
    return output;
}

Geometry shader: From each vertex (sphere center position), it generates a triangle stream (SPHERE_SLICE_COUNT * 6 + ((SPHERE_STACK_COUNT – 2) * (SPHERE_SLICE_COUNT + 1)) * 6 = 246 vertices)

#define SPHERE_RADIUS 4.0f
#define SPHERE_SLICE_COUNT 6
#define SPHERE_STACK_COUNT 7
#define NUM_SPHERE_VERTICES (SPHERE_SLICE_COUNT * 6 + ((SPHERE_STACK_COUNT - 2) * (SPHERE_SLICE_COUNT + 1)) * 6)
#define PI 3.14159265359

struct GSInput {
    float4 PosWS : POSITION;
};

cbuffer cbPerFrame : register (b0) {
    float4x4 ViewProjection;
    float QuadHalfSize;
};

struct GSOutput {
    float4 PosH : SV_POSITION;
};

[maxvertexcount(NUM_SPHERE_VERTICES)]
void
main(const in point GSInput input[1], inout TriangleStream<GSOutput> triangleStream) {
    const uint positionsSize = 2 + (SPHERE_STACK_COUNT - 1) * (SPHERE_SLICE_COUNT + 1);
    float4 positions[positionsSize];

    //
    // Compute the vertices stating at the top pole and moving down the stacks.
    //

    float4 topVertex = float4(input[0].PosWS.x, input[0].PosWS.y + SPHERE_RADIUS, input[0].PosWS.z, 1.0f);
    float4 bottomVertex = float4(input[0].PosWS.x, input[0].PosWS.y - SPHERE_RADIUS, input[0].PosWS.z, 1.0f);

    positions[0] = mul(topVertex, ViewProjection);

    const float phiStep = PI / SPHERE_STACK_COUNT;
    const float thetaStep = 2.0f * PI / SPHERE_SLICE_COUNT;

    // Compute vertices for each stack ring (do not count the poles as rings).
    uint currentIndex = 1;
    for (uint i = 1; i < SPHERE_STACK_COUNT; ++i) {
        const float phi = i * phiStep;
        const float sinPhi = sin(phi);
        const float cosPhi = cos(phi);

        // Vertices of ring.
        for (uint j = 0; j <= SPHERE_SLICE_COUNT; ++j) {
            const float theta = j * thetaStep;

            float4 v = input[0].PosWS;

            // spherical to cartesian
            v.x += SPHERE_RADIUS * sinPhi * cos(theta);
            v.y += SPHERE_RADIUS * cosPhi;
            v.z += SPHERE_RADIUS * sinPhi * sin(theta);

            positions[currentIndex] = mul(v, ViewProjection);
            ++currentIndex;
        }
    }

    positions[currentIndex] = mul(bottomVertex, ViewProjection);

    // Generate sphere triangles
    GSOutput output;

    // Generate triangles for top stack.
    for (uint k = 1; k <= SPHERE_SLICE_COUNT; ++k) {
        output.PosH = positions[0];
        triangleStream.Append(output);

        output.PosH = positions[k + 1];
        triangleStream.Append(output);

        output.PosH = positions[k];
        triangleStream.Append(output);

        triangleStream.RestartStrip();
    }

    //
    // Generate triangles for inner stacks (not connected to poles).
    //

    // Offset the indices to the index of the first vertex in the first ring.
    // This is just skipping the top pole vertex.
    uint baseIndex = 1;
    const uint ringVertexCount = SPHERE_SLICE_COUNT + 1;
    for (uint n = 0; n < SPHERE_STACK_COUNT - 2; ++n) {
        for (uint m = 0; m <= SPHERE_SLICE_COUNT; ++m) {
            output.PosH = positions[n * ringVertexCount + m];
            triangleStream.Append(output);

            output.PosH = positions[n * ringVertexCount + m + 1];
            triangleStream.Append(output); 

            output.PosH = positions[(n + 1) * ringVertexCount + m];
            triangleStream.Append(output);

            triangleStream.RestartStrip();

            output.PosH = positions[(n + 1) * ringVertexCount + m];
            triangleStream.Append(output);

            output.PosH = positions[n * ringVertexCount + m + 1];
            triangleStream.Append(output);

            output.PosH = positions[(n + 1) * ringVertexCount + m + 1];
            triangleStream.Append(output);

            triangleStream.RestartStrip();
        }
    }

    //
    // Generate triangles for bottom stack
    //

    // South pole vertex was added last.
    const uint southPoleIndex = positionsSize - 1;

    // Offset the indices to the index of the first vertex in the last ring.
    uint offset = southPoleIndex - ringVertexCount;
    for (uint l = 0; l < SPHERE_SLICE_COUNT; ++l) {
        output.PosH = positions[southPoleIndex];
        triangleStream.Append(output);

        output.PosH = positions[offset + l];
        triangleStream.Append(output);

        output.PosH = positions[offset + l + 1];
        triangleStream.Append(output);

        triangleStream.RestartStrip();
    }
}

Pixel shader: Returns color only

struct PSInput {
    float4 PosH : SV_POSITION;
};
float4
main(in PSInput input) : SV_TARGET {
    return float4(1.0f, 0.0f, 0.0f, 1.0f);
}

Vertex Shader Technique

Input Layout

Position of each vertex of each box.

 D3D11_INPUT_ELEMENT_DESC inputElementDescriptions[] = {
{ POSITION, 0, DXGI_FORMAT_R32G32B32_FLOAT, 0, 0, D3D11_INPUT_PER_VERTEX_DATA, 0 },
};

Buffers

  • Vertex buffer: It contains 44 vertices per sphere (position).
  • Index buffer: It contains 246 indices per sphere, to build 82 triangles per sphere

Shaders

Vertex shader: Coordinate space transformations only

struct Input {
    float3 PosOS : POSITION;
};
struct Output {
    float4 PosH : SV_POSITION;
    float3 Color : COLOR;
};
cbuffer CBufferPerFrame : register (b0) {
    float4x4 WorldViewProjection;
}
Output main(const Input input, const uint vertexId : SV_VertexId) {
    Output output = (Output)0;
    output.PosH = mul(float4(input.PosOS, 1.0f), WorldViewProjection);
    const float colorComp = (vertexId % 3) * 0.5f;
    output.Color = float3(colorComp, 0.0f, colorComp);
    return output;
}

Pixel shader: Returns color only

struct PSInput {
    float4 PosH : SV_POSITION;
    float3 Color : COLOR;
};
float4
main(in PSInput input) : SV_TARGET {
    return float4(input.Color, 1.0f);
}

Benchmarks

I tested these 3 techniques with a different number of spheres in the previously described machine. It is important to mention that geometry is static and all buffers generation was done at the beginning of the execution and was not taken into account in FPS computation. These are the results:

10k

100k

1kk

Conclusion

There is a clear winner: Instancing Technique.

If the number of spheres is N = 1kk, we are going to compute how many bytes each technique sends from CPU to GPU in each draw call:

  • Instancing Technique:
    • Vertex Buffer = 44 * 3 * sizeof(float) = 528 bytes +
    • Index Buffer = 246 * sizeof(unsigned int) = 984 bytes +
    • Instance Buffer = N * 3 * sizeof(float) = 12*N bytes
    • Total = ~11.445149 MB
  • Geometry Shader Technique:
    • Vertex Buffer = N * 3 * sizeof(float) = 12 * N bytes
    • Total = ~11.440917 MB
  • Vertex Shader Technique:
    • Vertex Buffer = N * 44 * 3 * sizeof(float) = 528 * N bytes +
    • Index Buffer = N * 246 * sizeof(unsigned int) = 984 * N bytes
    • Total = ~1419,06738 MB

There is a clear loser: Vertex Shader Technique. Instancing and Geometry shader techniques send similar amount of data. In fact, Vertex Shader technique sends 124x more than the other two.

Some points to take into account from Round 1 and 2:

  • Instancing performs better when your geometry has a big number of vertices. Otherwise, performance will suffer.
  • Geometry shader performs better when your geometry does not have a big number of vertices, and its generation does not need expensive operations, like sin or cos. Also, you have maximum numbers of scalar to generate in your geometry shader.
  • Vertex shader technique needs to generate all the geometry in CPU, and send it from CPU to GPU in each frame, consuming a lot of bandwidths.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s