Recently, I felt I needed to improve my knowledge about perspective projection matrix and depth buffer, so I think a good idea is to write some posts explaining the doubts I had and the sources I read to evacuate them. This could be useful for others and a good reference for me.
I am a DirectX user, so all I will describe here is related to that API. OpenGL/Vulkan is similar, but I am not going to mention/explain them.
The main motivation to revisit my perspective projection matrix and depth buffer knowledge were because I could not accomplish SSAO in my little DirectX12 rendering engine. I tried and tried, and I got some ideas from some great tutorials from Frank Luna and John Chapman, but unluckily my implementation did not work properly. My friend Pablo Zurita suggested me to think about the problem without taking into account any tutorials or something related to it. Instead, he motivated me to try to solve the problem with my own solution no matter if it is optimal or not. I decided to follow his advice, and that is why I wanted to improve/fix my knowledge about perspective projection matrix and depth buffer, before continuing with SSAO.
“How can I find the pixel space coordinates of a 3D point”?
First of all, we have a point Po = (Xo, Yo, Zo, 1.0f) in Object / Local / Model space coordinates. We transform the point from its original space to World Space using a matrix called World Space matrix (W, for short). This is a transformation from (R, R, R, 1.0) to (R, R, R, 1.0) where R is the real number space.
Pw = Po * W = (Xw, Yw, Zw, 1.0)
Second, we need to transform the point Pw to the coordinate system of the camera. Again, we need to use a matrix called Eye / View / Camera Space matrix (V). This is a transformation from (R, R, R, 1.0) to (R, R, R, 1.0).
Pv = Pw * V = (Xv, Yv, Zv, 1.0)
We use a left-handed coordinate system, so the View Space Coordinate System has the following properties:
- Camera is at (0.0, 0.0, 0.0, 1.0)
- Forward vector is +z = (0.0, 0.0, 1.0, 0.0)
- Up vector is +y = (0.0, 1.0, 0.0, 0.0)
- Side vector is +x = (1.0, 0.0, 0.0, 0.0)
Next, we need to transform the point Pv to Homogeneous Clip Space coordinates. This is accomplished by multiplying the point by the Perspective Projection matrix (P). This is a transformation from (R, R, R, 1.0) to (R, R, R, R).
Ph = Pv * P = (Xh, Yh, Zh, Wh)
After this transformation, the viewport clipping and culling is performed (we need to check if the point is inside the viewing frustum). As described in a Fabian Giesen’s post, the point Ph will not be clipped if the following conditions are accomplished
- -Wh <= Xh <= Wh
- -Wh <= Yh <= Wh
- 0.0 <= Zh <= Wh
- 0.0 < Wh
Next, we need to transform the point Ph to the Normalized Device Coordinates space or NDC for short. This is accomplished by dividing each Ph coordinate by Wh. This is frequently called the perspective division. This is a transformation from (R, R, R, R) to [-1.0, 1.0] x [-1.0 x 1.0] x [0.0, 1.0] x 1.0.
Pndc = Ph / Wh = (Xh / Wh, Yh / Wh, Zh / Wh, Wh / Wh)
No matter what dimensions our view window has, the persective projection and subsequent perspective division, always transform from (R, R, R, R) to [-1.0, 1.0] x [-1.0 x 1.0] x [0.0, 1.0] x 1.0. This “magic” is encapsulated in the perspective projection matrix generation that will be explained in another post.
Why do we need to perform View Space transformation before Perspective Projection transformation? Why cannot we use World Space transformation directly?
You can answer this question by reading Model Space, World Space, View Space section of CodingLabs article or Of the Importance of Converting Points to Camera Space of ScratchAPixel article. In summary, the math simplifies a lot if we have the camera centered at the origin and watching down one of the three axes.
Why is Wndc always 1.o?
Wh / Wh is 1.0 if Wh != 0.0. We will not have Wh = 0.0 at this time because it is discarded by the viewport clipping and culling step.
Why clipping is not performed here, and it is done with the previous transformation (Ph)?
After some research, I found the next GameDev.StackExchange post that answers this question:
Why does DirectX’s Zndc go from 0.0 to 1.0 instead of going from -1.0 to 1.0 as OpenGL?
That was another question I had, I thought OpenGL approach was more intuitive, more “symmetric,” but after some research, I found an excellent Outerra article that explains why it is better the DirectX approach in its “DirectX vs OpenGL” section. I am not going to quote its explanation here because it is long and you need to read part of the article first, so please go and read that article.
How do clip space and NDC space look like? Are they cubic?
It depends on the API you are using. In DirectX the Z coordinate in NDC space ranges from 0.0 to 1.0, while it ranges from -1.0 to 1.0 in OpenGL. So NDC space is cubic in OpenGL, but it is not in DirectX.
You can see in the following picture, the transformation from View Space to NDC Space (perspective projection transformation + NDC transformation) in DirectX.
Finally, we need to transform the point Pndc to Pixel / Viewport Coordinates space. You can see this new space and its relationship with NDC space in the following picture extracted from MSDN documentation.
Note that in DirectX pixel centers are offset by (0.5f, 0.5f) from integer locations. In OpenGL the situation is different, the integer locations of the pixel are at the center of the pixel. Check the following picture
Xp = (Xndc + 1.0) * ScreenWidth * 0.5 + ScreenTopLeftX
Yp = (1.0 – Yndc) * ScreenHeight * 0.5 + ScreenTopLeftY
Zp = MinDepth + Zndc * (MaxDepth – MinDepth)
Typically, if you use all the screen, top left x and y will be zero. This is a transformation from (R, R, R, 1.0) to (Z+, Z+, R) where Z+ are the non-negative integers space. Note we can discard w coordinate because it is not used anymore here, and note we use non-negative integers for x and y coordinates because they are pixel integer locations. Pixel centers are offset by (0.5,0.5) from integer locations.
Ok, I understood the big picture of all this, but why is a 3D point represented as (x, y, z, 1.0)?
A point in 3D can be represented by (x, y,z). If we have a point representation in n-dimensional space, then we can represent the same point in a (n+1)-dimensional space by scaling the original coordinates by a single value and then adding the scalar to the end as our final coordinate:
(X0, …, Xn-1) —> (kX0, …, kXn-1, k)
This new space is called real projective space or homogeneous space.
A 2D point is represented by (x, y). Let’s examine the point (x, y, 1) = (1.0, 0.8, 1.0)
For all points that are not in the plane w = 1, we can compute the corresponding 2D point by projecting the point onto the plane w = 1, by dividing by w. So the homogeneous coordinate (x, y, w) is mapped to the point (x / w, y / w).There are an infinite number of corresponding points in homogeneous space (kx, ky, k) with k != 0.0. In particular, k = 2.5 gives (x, y, w) = (2.5, 2.0, 2.5), as you can see in the previous image.
When w is 0.0, there is no corresponding point in 2D space (the division is undefined). We can interpret this situation as a “point at infinity.” This is considered a direction (vector) rather than a location (point). The difference between two points A and B having a w coordinate of 1.0 results in a direction vector B – A having a w coordinate of 0.0. This makes sense because B – A represents the direction pointing from A to B (not affected by translation).
The same concept applies to 3D points
(x, y, z) —> (x, y, z, w)
Why do we need to represent 3D points with 4 coordinates?
After explaining the meaning of (x, y, z, 1.0), we need to ask about its usage. There are 2 main reasons:
- Translation cannot be done with a 3X3 matrix because (0.0, 0.0, 0.0) * M3x3 = (0.0, 0.0, 0.0) for any 3×3 matrix (so it cannot be translated). We are not going to explain translation matrix generation in this post, but it looks like the following image
- Setting a proper value for w let us perform perspective projection (by the perspective division)
An excellent article to understand the homogeneous coordinates deeply and its geometric interpretation is Chapter 18 – A Trip Down the Graphics Pipeline – Jim Blinn.
In addition to all the links cited in the article, I used the following references:
In the next post, I am going to write about how to compute the perspective projection matrix.