| |
|
|
||||
![]() |
||||||
| |
|
|||||
|
Example 1: A 16-bit floating-point Z-buffer
What is the drawback of Z-buffers? They’re costly, that’s true. But the cost of a Z-buffer depends on the one parameter associated with its algorithm: the depth. There are several choices when it comes to constructing an efficient Z-buffer, but it turns out none of them are perfect for our use. The highest precision type of Z-buffer is given by the 32-bit floating-point Z-buffer. However, this flavor of the algorithm is also the slowest and it can’t be used with MMX code – so we can disregard it. The 32-bit integer Z-buffer is faster, and remains satisfying in terms of precision: it looks pretty good. Finally, the 16-bit integer Z-buffer is not accurate enough, although it would be quite quick. So the best choice is the 32-bit integer Z-buffer. Do we have to be satisfied with such a situation? No. Game developers are greedy, and we want both 32-bit integer precision and 16-bit integer speed and compactness. It’s this desire that germinated the idea of a 16-bit floating point Z-Buffer. You may be saying to yourself, "You crazy Frenchmen, 16-bit floats don’t exist!" and you’re quite right: they don’t. But I’m are not crazy, either. In order to use 16-bit floating-point Z-buffers, you have to create them. Before diving deep into the description of small floating-point numbers, I want to clear something up. It’s easy to imagine that the code needed to implement the 16-bit floating-point (16bFP) Z-buffer is more complex than the code needed for the 32-bit integer (32bINT) Z-buffer, and you may be concerned that this added complexity will slow down your game. However, remember the lesson that was learned when we tried to structure the image data in memory: sometimes the efficiency gained from using less memory outweighs the penalty imposed by using more complex code. At the end of this example, I’ll explain the circumstances in which the 16bFP is faster than the 32bINT, but keep in mind that both have comparable results. With that in mind, let’s get started creating 16bFP! First, ask yourself the following: "What arithmetic operations are used in the Z-buffer algorithm?" We need just two operations: addition and comparison (see Figure 1).
The first approach is to write code for these operations for our new representation. I advise those of you that are not familiar with floating point number structure to look at Figure 2, or better yet, read Chris Hecker’s column, "Let’s get to the [floating] point" (Game Developer magazine, February 1996). Of the two operations, the comparison could be the simpler if a 16-bit floating-point number structure similar to the 32-bit one was used (exponent in the top part, and mantissa in the low part). This structure allows us to compare two floating point numbers as if they were integers, which ensures a fast and easy comparison. Now, if we can find an efficient addition scheme for two 16-bit floats, we’ll be set. A floating point number can be divided into 3 parts : the sign bit, the exponent and the mantissa.
Unfortunately, I have not found an efficient way to perform 16-bit floating point addition. Instead, I use a somewhat different approach. First I use the simplest addition operation that available: 32-bit integer addition. After this addition, the result must be converted into a 16-bit floating point number, which turns out to be rather simple. This conversion looks like a compression scheme: the idea is to work like floating point numbers by eliminating the most significant bits that are at zero. In other words, a 16-bit floating number is composed by a first integer number that indicates the position of the first (most significant bit) that is set (we call it the exponent) and a second integer number created by a truncated piece of the original 32-bit integer number (the mantissa). For example, assume the original 32-bit integer number is represented as: 00010101010101010011001001110000b The most significant bit that is at 1 is the 28th (11100b) in this example, but it could have any value from 31 down to 0. So, we need five bits to code the exponent. That leaves us 16-5=11 bits for the mantissa. For this example, then, the mantissa is (in bold): 000 10101010101 010011001001110000b The final 16 bit floating point number looks like this: 11100 10101010101b This organization preserves the magnitude-order of the numbers, so we’ve succeed in our goal to keep a simple comparison operation. Let’s sum up what we said by rewriting the algorithm (see Figure 3).
Let’s look at the conversion function now (see Figure 4). There are several ways to perform the conversion. I’ll describe the smallest one, which needs only three x86 instructions. Note than even though this is the smallest implementation, it is not necessarily the fastest: depending on the context, other solutions could be faster.
First we need to know which is the most significant bit that is set. To do determine this, you can use the x86 instruction called BSR (Bit Search Reverse). This instruction used to consume several dozen cycles on a Pentium processor, but it became a just two-cycle instruction on the Pentium II. Second, eliminate the top three most significant bits like this: From : EAX = 00010101010101010011001001110000b To : EAX = 10101010101010011001001110000000b To do that, the most intuitive solution is to shift EAX three bits to the left. However, to compute the number of bits to shift, we need to use at least two instructions and to make use of one more register (MOV EDX,31; SUB EDX,ECX) in addition to the shift instruction. Another possibility is to use the more exotic SHRD (Shift Right Double) instruction. This instruction shifts a register to the right, introducing the shifted bits into another register from its left hand side. (You’ll notice that the most significant bit set is lost, as real floating numbers.) For the third step, we just need to combine the exponent (ECX) and the mantissa (EDX). To do that we use the SHRD operation again. We’re almost done. Our 16-bit float is now present in the upper half of EDX register, but to use it, it must be in the low half (DX) of the register. For this reason we won’t use the last SHRD instruction with a count of 5, but with a count of 5+16=21. Let’s sum up the conversion function. The total resources used are three instructions and uses two registers: BSR ECX ,EAX SHRD EDX ,EAX,CL SHRD EDX ,ECX,21 You’ve may have noticed that the EDX register isn’t initialized. This is only necessary if the 32bINT number is less than 2^11 - 1 = 2047 (meaning that the most significant bit set is less than the mantissa size in bits). However, experience shows that we can usually do without this initialization. These three instructions are "long decoded" on Pentium II processors, which means that every instruction needs one cycle to be decoded, and generates two micro-operations each. In order to lower the cost of conversion, the instructions above should be interlaced with simple, decoded instructions. Before I move on, let me issue a caveat regarding the BSR instruction. This instruction is usable only on the Pentium Pro, Pentium II, Pentium III and Celeron processors. The cost of using the BSR instruction remains exorbitant on all other processors, including the K6-family. Now that we’ve looked at the 16-bit floating point Z-buffer, let’s think about its performance. I ran some tests on a 233 MHz Pentium II with EDO RAM, and found the performance of a 32-bit integer Z-buffer to be very similar to the performance of the 16-bit floating point version, so we can choose between them depending on the situation. The 32bINT Z-buffer should be used on systems where memory access is cheaper (such as those systems that have fast SDRAM), and the 16bFP is a better solution on Celeron-based PCs where memory access must be avoided at all costs. Even with the 128KB L2 cache of Celeron A processors, the 16bFP solution minimizes cache pollution, thereby achieving better performance. The moral of this story is fairly simple: Increasing code complexity to reduce data size is a legitimate way to achieve better performance.
|
|||||||||||||||||||||||||||||||||||||||||
|
|