|
Features

Multi-Threaded Terrain Smoothing
Multi-Threaded essellation with OpenMP*
You can use OpenMP* to add thread level parallelism to your application by adding special OpenMP* compiler directives to you source code. These directives come as pragmas that do not change the semantics of your initial code and are therefore non-intrusive. As you will see they are easy to add and you can use them to incrementally parallelize your code. Of course you need a compiler that supports OpenMP* but luckily VS2005* and the also the Intel C/C++ compiler both support it. Source code with OpenMP* pragmas is portable to the Xbox360 because the Microsoft* compiler supports it there as well. Still, this article is not meant to be a primer on OpenMP* - please see [OpenMP] for an in-depth introduction on it.
The main usage model for OpenMP* is a fork and join multi-threading, which means that a set of threads fork from the main execution flow and work together on a shared set of tasks. After having finished their work they all join again. Since OpenMP* uses an internal thread pool there is no thread creation or cleanup overhead.
For the purpose of the multi-threaded tessellation a data parallel OpenMP* pragma is used to parallelize a for()-loop. As indicated by Code Fragment 2 OpenMP* is told work on N tessellation workloads in parallel. The OpenMP* runtime will decide how many threads it will use to do this work. By default it is the number of hardware threads supported by your machine. It is possible though to change this through OpenMP* library calls.
Figure 2 shows what would happen on a machine capable of four hardware threads – please note that the MainThread is also working with the other threads. Also note that the main thread does the culling and workload distribution and also the drawing.
If you start the demo application it starts in a mode that uses OpenMP* to do a data parallel tessellation of the tessellation workload as just described. A slider in the demo can be used to tell OpenMP* to use as many hardware threads as your machine can run in parallel.
Unfortunately the heavy use of SSE on all threads does not work well with using all logical processors of a hyperthreading system, and will even result in a slowdown. D3D* and the graphics driver which run on the main thread also make use of the SSE units. If you also wanted to use all logical processors and gain a speedup you would have to write additional tessellation code that does not use the SSE units at all. The demo can use affinity masks to try to make sure that only one of the two logical processors of an HT core will be used for tessellation (see below). Still if you ever get hold of a real 4 core machine the demo allows you to use them.
To prove that the demo can really reach a speedup on a dual core machine do the following:
- Make sure that the device settings indicate that vertical syncing is off
- Select the number of threads to be used to one
- Tick ‘Use OpenMP’
- Increase the viewing distance until you go down to 60 FPS. It is assumed that your graphics card is fast enough to run the initial settings at over 60 FPS.
- Increase the number of threads to be used to two.
- You should see the frame rate go up again, obviously only if you really have a two core machine. If there is no speedup or almost no speedup, then the tessellation workload is not the limiting factor. Most likely your graphics card is then transform or memory (transfer) limited which means rendering is relatively expensive. To check this you can un-tick the ‘Tessllation running’ box. After that you should see how fast your card can draw the vertex load generated by the tessellation.
You will have noticed that the speedup is not necessarily very high. Depending on you system and your graphics card you can get an increase of frame rate from e.g. 60 to say 75 FPS which would be a speedup of roughly 25%. Again, how much speedup you get is determined by how fast your system can render the tessellated scene. If rendering cost is small compared to the tessellation cost the speedup gained with OpenMP* can be higher. One test-system I used produced a 50% speedup.
If you bring up the Windows* task manager it becomes apparent that OpenMP* does not use affinity masks to try to lock threads on certain cores or processors. You will see that Windows* reschedules the tessellation threads trying to minimize core utilization. For our purpose this is not too bad but it might be worth trying to bind threads to certain cores.
The reason why one can’t get a higher speedup on certain systems (where rendering is relatively expensive) is that the time the tessellation work takes does represent a relatively small percentage of the overall frame processor load on these systems. Culling, workload distribution and mainly rendering are taking most of the time. This is not necessarily a problem and actually can be predicted by Amdahl’s Law (see [DevMTApps]). This law in a nutshell states that the maximum parallel speedup one can reach is limited by the serial portion of your code. Since the rendering is done in just one thread it limits the speedup. Still it is possible to reach higher frame rates by decoupling tessellation work from rendering. How this can be done is discussed next.
Asynchronous Multi-Threaded Tessellation
To reach a maximum frame rate on systems where the rendering cost is high when compared to tessellation one ideally wants to completely decouple rendering from culling and tessellation. The basic idea is to only pick up a new terrain tessellation when it is done. To cope with camera movements a triangle strip for an enlarged view frustum can be generated. The demo does not do this. You will thus notice that for fast rotations there simply is no terrain available for a short moment in time.
The asynchronous threading architecture that is realized in the demo (activated if you un-tick ‘Use OpenMP’) is shown in Figure 3. For this architecture one needs two vertex buffers that are used alternatively. One vertex buffer is rendered by the main thread. The other vertex buffer is asynchronously filled by the tessellation threads. The main thread checks every frame if a new tessellation is available. If it is available, it from then on uses the new vertex buffer to be drawn. It then locks the other (old) vertex buffer and hands it off to the tessellation threads to fill it. This is done in a round robin fashion.
The synchronization of the threads is handled using Windows* events. One event is used to signal the main tessellation thread that it should start a new tessellation. The main tessellation thread uses yet another event to signal to the main thread that a new tessellation is available. The main tessellation thread itself first does the culling and the workload distribution. After that it signals a set of events that will kick off additional tessellation threads. That is, if there are more than two cores in your system. The additional tessellation threads will work along with the main tessellation thread to finish the tessellation. Each additional tessellation thread signals the main tessellation thread when it has finished its job by setting its own event.
The main tessellation thread does a WaitForMultipleObjects() to wait for all its siblings to finish before signaling the main thread.
The demo application actually initially does not run completely asynchronously but the main thread waits until the last tessellation has been done by the tessellation threads kicked off last frame. Interestingly you will still see frames with an incomplete terrain. The reason for this is that the main tessellation thread has picked up a view cone for culling that is not the same used by the main thread when drawing the actual frame. This can be rectified if we accept a one frame lag.
You can now switch to fully asynchronous mode if you un-tick ‘Wait for tessellation’. In this case the main thread will only use a new tessellation when it is done.
All threads used by the tessellation are created at the startup of the demo, so no thread creation or cleanup is going on while the demo is running. In addition to that, all threads including the main thread can be affinity-bound to exactly one logical processor of one of the cores by setting the appropriate affinity masks for them. This has been done to enable the use of the Windows* task manager to really see how much processor time is spent for tessellation and on each core – that is if Windows* really respects the affinity masks.
The slider for the number of threads is used differently when un-ticking ‘Use OpenMP’ and running asynchronously. It specifies the number of threads, including the main tessellation thread, to be used for the asynchronous tessellation. On a two core machine it should be left at a value of one.
Compared to the OpenMP* mode you should now, using the same viewing distance and tessellation settings, see a much higher frame rate if the render cost is high when compared to the tessellation cost. If rendering is cheap when compared to tessellation you will see a smaller speedup than with OpenMP*. Depending on your machine this means you can let the player look even further or increase the quality of the tessellation. If you un-tick ‘Wait for tessellation’ the frame rate you see is independent of the complexity of the tessellation workload. It should be the same that you see when un-ticking ‘Tessellation running’.
The Demo
The source code for the demo (see Figure 5) is available for download, so everybody can have a look. The culling code that has been implemented is far from optimal, but you may stick your own culling code into the sample.
The demo pre-computes a grid of patches from a height field in memory. It would be easy to change the code to work on a height field that is synthesized on-the-fly and does not sit in memory at all. Also it is probably also easy to port the SSE intrinsics to appropriate code for the vector units of the new consoles.
In addition to the tessellation code, you will find the source code for a library that implements CPU detection (written by my colleague Leigh Davies). The CPU detection library enumerates cores and logical processors which enables you to detect which logical processor is a HT core. Please note that the CPU detection code is supposed to work on all IA32* PC processors not only on Intel processors.
Conclusion
This article has described how to multi-thread terrain smoothing in a scalable way. The tessellation will be faster with every core you allow the code to use. Initial performance tests indicate that the OpenMP* code path can tessellate and display a terrain with around 20-40 million vertices a second on a dual core processor system. Further the graphics card that has been used could draw tessellated terrain from dynamic vertex buffers at roughly 70 million vertices a second. This indicates that additional cores can be successfully used to do dynamic terrain tessellation and generate other dynamic geometry generation like procedural plants. Just imagine a forest with trees that all look different. Furthermore it has been shown that additional cores can be used to offload tasks from the graphics card. The graphics card would otherwise have to do terrain tessellation in addition to what it has to do anyway. It seems as if the new consoles have even more efficient ways to push dynamic geometry to the graphics card (see [Stokes05]), so the approach described in this article could probably be applied very successfully.
References
[DevMTApps] ‘Developing Multithreaded Applications: A Platform Consistent Approach’ online at http://www.intel.com/cd/ids/developer/asmo-na/eng/dc/mrte/53797.htm?page=1
[D3D05] DirectX9c December SDK – Available online from www.microsoft.com
[Farin96] Farin, Gerald E. “Curves and Surfaces for Computer-Aided Geometric Design“Academic Press Inc. (London) Ltd (8. Oktober 1996)
[Foley90] Foley James D., van Dam Andries, Feiner. Steven K., Hughes John F. ,”Computer Graphics”, Addison Wesley 1990
[Gruen05] – Gruen Holger, “Efficient Tessellation on the GPU through Instancing”, Journal Of Game Development Volume 1, Issue 3, Thomson Delmar Learning, December 2005
[Bunnell05] Bunnell Michael, “Adaptive Tessellation of Subdivision Surfaces with Displacement Mapping“, GPU Gems II, Addison Wesley 2005
[IntelTools] http://www.intel.com/cd/software/products/asmo-na/eng/threading/219783.htm
[Gabb05] Gabb Henry, “Threading 3D Game Engine Basics” available online at http://www.gamasutra.com/features/20051117/gabb_01.shtml
[OpenMP] www.openmp.org
[Klimovitski01] Klimovitski Alex, “SSE/SSE2 Toolbox Solutions for Real-Life SIMD Problems“, Game Developer Conference 2001, available online at http://www.gamasutra.com/features/gdcarchive/2001E/Alex_Klimovitski3.pdf
[Stokes05] Stokes Jon
‘Inside the Xbox 360, part I: procedural synthesis and dynamic worlds’ available online at http://arstechnica.com/articles/paedia/cpu/xbox360-1.ars
‘Inside the Xbox 360, Part II: the Xenon CPU’ available online at http://arstechnica.com/articles/paedia/cpu/xbox360-2.ars
‘Introducing the IBM/Sony/Toshiba Cell Processor — Part I: the SIMD processing units’ available online at http://arstechnica.com/articles/paedia/cpu/cell-1.ars
‘Introducing the IBM/Sony/Toshiba Cell Processor -- Part II: The Cell Architecture’ available online at http://arstechnica.com/articles/paedia/cpu/cell-2.ars
[West06] West Nick, “The Inner Product: Multi-core Processors” Game Developer Magazine Volume 13, Number2, February 2006
*Names and brands may be claimed as the properties of others
|