It's free to join Gamasutra!|Have a question? Want to know who runs this site? Here you go.|Targeting the game development market with your product or service? Get info on advertising here.||For altering your contact information or changing email subscription preferences.
Registered members can log in here.Back to the home page.

Search articles, jobs, buyers guide, and more.

By Alexandre Macris and
Pascal Urro

Gamasutra
April 9, 1999

Letters to the Editor:
Write a letter
View all letters


Features

Example 2: Managing Big Triangles with
Large Texture Maps

Contents

Introduction

Example 1: A 16-bit floating-point Z-buffer

Example 2: Managing Big Triangles with Large Texture Maps

The second example, which I’ll present now, is more persuasive for a number of reasons. First, this problem is directly related to cache memory. Second, the solution can’t be found without a deep analysis of the way cache memory works. Third, the solution requires a minuscule change in the program.

In this example let’s look at a software rendering engine, to be used in a Quake-like game. In these kind of games, we find corridors, rooms, and open spaces. In big rooms or open spaces, we have huge roofs (or a sky), and huge floors. We’ll be talking about these huge surfaces.

Let’s assume that the roof in a particular room is a square divided in two triangles, and one 256x256 texel texture is used for the whole square. Here’s the problem: when a player looks at the roof, the game’s frame rate varies from a certain value to half that value, whenever the player rotates his head. To solve the problem, you don’t have to change one line of code. You simply have to add a pitch of sixteen texels to your mapped texture. This means that the 256x256 texel maps will be stored in a memory space of 272x256 texels.

The texture map is stored in memory as shown in Figure 5. Imagine that you are in the middle of the room, facing North, and looking at the roof (or the floor). The rendering engine maps the two triangles, reading the texture in the order that it’s stored, and everything is fine. A problem emerges, however, when you turn your head 90 degrees in either direction, because the texture is no longer being read in the same order that it’s organized in memory. To clarify, let’s take a closer look at what is happening in the left side of the texture map.

A0 B0 C0 D0 E0 F0 G0 H0 I0 J0 K0 L0 M0 N0 O0 P0 Q0 R0 S0 T0 U0 V0 W0...
A1 B1 C1 D1 E1 F1 G1 H1 I1 J1 K1 L1 M1 N1 O1 P1 Q1 R1 S1 T1 U1 V1 W1...
A2 B2 C2 D2 E2 F2 G2 H2 I2 J2 K2 L2 M2 N2 O2 P2 Q2 R2 S2 T2 U2 V2 W2...
A3 B3 C3 D3 E3 F3 G3 H3 I3 J3 K3 L3 M3 N3 O3 P3 Q3 R3 S3 T3 U3 V3 W3...
A4 B4 C4 D4 E4 F4 G4 H4 I4 J4 K4 L4 M4 N4 O4 P4 Q4 R4 S4 T4 U4 V4 W4...
A5 B5 C5 D5 E5 F5 G5 H5 I5 J5 K5 L5 M5 N5 O5 P5 Q5 R5 S5 T5 U5 V5 W5...
A6 B6 C6 D6 E6 F6 G6 H6 I6 J6 K6 L6 M6 N6 O6 P6 Q6 R6 S6 T6 U6 V6 W6...
A7 B7 C7 D7 E7 F7 G7 H7 I7 J7 K7 L7 M7 N7 O7 P7 Q7 R7 S7 T7 U7 V7 W7...
...
...
A255 B255 C255 D255 E255 F255 G255 H255 I255 J255 K255 L255 M255...

Assume that each letter-digit pair represents a 16
bit texel, and that the surface to map is exactly the size
of the texture.

Figure 5: How the texture map is stored

 

What happens when we look at the roof in the "good" way? Well, first the A0 texel is read and written to the corresponding pixel. The same treatment is given to the texels B0, C0, D0, and so on. Nothing strange here. What is happening from the point of view of the cache? For the cache, reading A0 will cause a cache line fill. This operation store texels A0 to P0 in the cache memory (on Pentium II, the cache line size is 32 bytes, but K6-family processors have 2x32 byte size lines). So a certain number of cycles will be consumed during the access to the A0 pixel, but accessing texels B0 through P0 does not consume any other processor cycles.

What happens when we look at the roof in the "bad" way? This time, it’s not the letter that will change, but the digit. Lines A0, A1, A2,... will be accessed successively, so the cache will, as expected, load lines A0 to P0, A1 to P1, A2 to P2, and so on. This induces costly penalties due to cache misses and cache loads.

As the first line (A0 to A255) is processed, a huge number of cycles are lost. That wouldn’t be a problem if the texels B0 to B255, C0 to C255, D0 to D255, ..., P0 to P255 were present in the data cache afterward. Will it be? It’s tempting to think so, since the entire structure will fit in cache (32 bytes per cache line * 256 texels in the texture map * 2 bytes per texel = 16KB). You might think that the entire A0,P0,A255,P255 block could be cached after reading texels A0 to A255. Unfortunately, this isn’t the case. The unhappy consequence of accessing A0 to A255 successively is that no texel is cached when accessed, which is why there is a considerable decrease in the game’s frame rate. So the heart of this problem becomes, "why is B0 not present in the cache memory when accessed?" Why is the data that we need not in cache when we expect it to be?.

First, level one cache memory organization is very similar on Pentium, Pentium II, and K6-family processors. The main difference is the cache size. On Pentium II processors, cache memory is organized as four two-dimensional arrays of 128 rows and 32 columns each, while K6 processors have two 256x64 arrays. The smallest element is one byte. We shall assume for this example that we have just one array of 512 rows and 32 columns in either processor. A byte is either present or absent in cache, and if it’s present, it can only be at one address. To know where, we need the address of the byte in the memory space. The five least significant bits indicate the row, while the next nine bits give us the line of the byte, as shown in Figure 6. It’s important to understand this addressing scheme.

Figure 6

 

Let’s see what happens when the texture map is accessed in the "bad" way. Assume the map is stored at the address 0x08800000. When accessing A0, the texels A0 to P0 are stored in the line 0,at 0x008800000. A1 is present at the address 0x08800200 (256 texels = 512 bytes), which means that texels A1 to P1 are located in the row 16. Hey – what about the rows 1, 2, ...,15? The answer is that they stay unused. So A2 to P2 goes in row 32, and so on, until we reach texels A31 to P31 which go into row 496. Once we reach that point, though, trouble begins: the data in A32-P32, as you might have guessed, gets placed in row 0, evicting the data that was already there (A0 to P0). Data is now being overwritten beginning with texel A32. This is why texel B0 is absent from cache when we try to access it later.

After investigating the caching mechanism, we are armed with enough knowledge to start looking for a solution. We have three options.

  1. We could store the texture map as a double specimen, one rotated 90 degrees from the other. This is a somewhat inelegant approach, however.
  2. We could split the two big triangles, shown in Figure 7a, into smaller ones as shown in Figure 7b. This is, perhaps, the easiest solution to the problem, but it’s not the most efficient: the addition of all of these smaller triangles will slow the frame rate a bit.
  3. The last approach is the one I prefer. The idea is to break up the pattern. This is achieved by using a texture map line size that isn’t a power of two. Adding 16 void texels (16 bits each) to the end of each line is sufficient on the Pentium II processor, but not enough on the K6, where 32 void texels is a better choice (see Figures 8a and 8b).

Cache Makes a Difference

Cache memory is a powerful tool. In order unleash its potential, however, it needs a little bit of help. There’s no need to use assembly language, even if it offers more flexibility than C/C++. In the last example we didn’t talk about the code at all, just about the data structures, and that’s our main point: always be sure that your data organization is cache friendly.

Alexandre Macris is currently working in the R&D department at Cryo Interactive as 3D Engine code optimizer for K6-2, Pentium II and Pentium III processors. He's waiting your feed back at a.macris@cryo-interactive.com. Pascal Urro has a Ph.D. in physics and chemistry and is a videogame addict. He has worked on video compression and 3D engines. He is now the R&D manager at Cryo Interactive. You can reach him at p.urro@cryo-interactive.com. Both authors would like to thank Joshua Thayer for his input while writing this article.


[Back to] ntroduction


join | contact us | advertise | write | my profile
news | features | companies | jobs | resumes | education | product guide | projects | store



Copyright © 2003 CMP Media LLC

privacy policy
| terms of service