Direct X, XNA, etc: March 2009

Thursday, March 26, 2009

The Beauty of Destruction (Pete Isensee, Microsoft)

C++ Destructor Definition

One
Special
Deterministic –> called at well defined times
Automatic –> object out of scope or delete
Symmetric –> constructor fits
Member –> part of a class
Function
With
- A special name (~)
- No parameters
- No return type
Designed to
- Give last rites
- Before object death

C# uses finalizer different (called by GC) non-deterministic, same in Java

When destructors are called

Global or static object, called when terminates
Arrays destructed in reverse way
STL container , unspecified order
delete operator
out of scope
temp objects
exception thrown (stack unwinding)
explicitly
exit()
abort (does not call destructor)

Order of destruction

Rule of thumb: Reverse order of construction
Specifically
- Destructor body
- Data members in reverse order of declaration
- Direct non-virtual base classes in reverse order
- Virtual base classes in reverse order

Implicit Destructors

not specified by programmer
inline by default
public
recommended for struct-like POD-only objects
for everything else, avoid implicit destructors
- better debugging
- improved perf analysis

Trivial Destructors

Implicit
Not virtual
All direct base classes have trivial dtors
All non-static members have trivial dtors
Destructors that never do anything

Virtual Destructors

Guarantee that derived classes get cleaned up
Rule of thumb: if class has virtual functions, dtor should be virtual
- if delete on Base* could ever point to a Derived*
Perf: Obj with any virtual funcs includes a vtable ptr
Idiom exceptions: mixin classes
Pure signals abstract class (virtual ~T() = 0{})

Partial Construction & Destruction

Dtors are only called for fully constructed objects
if a ctor throws, obj was not fully constructed
- obj dtor will not be called
- but fully constructed subobjects will be destroyed
Always use RAII with ctors
- Resource Acquisition Is Initialization

Virtual Functions in Destructors

Virtual functions are not virtual inside dtors

C++ Exception Handling

Destructors : Exceptions :: Spock : Kirk
Wrap any function that acquires a resource in a class where dtor releases the resource
Never allow an exception to exit a dtor
- Best: don’t throw in dtor
- OK: wrap throwing code in a try/catch
Good advice even if you don’t use C++EH

Multithreading

You are responsible for protecting objects and their contents
Sharing an object across threads
- Use shared_ptr
- or some other reference counting
- or otherwise ensure only one thread can destroy
Protect shared memory (global counters, ref counts) in dtor

delete and Destructors

delete p is a two-step process

Explicit Destructors

Destructors can be called directly
Avoid 99.9% of the time
Very powerful for custom memory scenarios
Examples
- w / placement new
- STL allocators

std::allocator

Allocators enable custom STL container memory
Two key destructive functions

shared_ptr

Templated non-intrusive deterministically referenced-counted smart pointer

shared_ptr deleters

Deleter : a functor called on the stored raw pointer when ref count hits zero

Performance

Destructors are called a LOT
they are invisible in code
streamline common dtors
the best dtor is empty
inlining
profile

The Rendering Technology of KillZone 2 (Michal Valient)

How we made Killzone 2 run @ 30FPS

Deferred shading
Diet for render targets
Dirty lighting tricks
Rendering, memory and SPUs

Deferred shading

not forward rendering
Geometry pass – fill the GBuffer (all material info for lighting)
loading depth map, normal / bump map, albedo (diffuse color and texture), shininess (reflective materials)
Lighting pass – accumulate info (only light, no textures)

GBuffer

RGBA FP16 buffers proved to be too much
Moved to RGBA8
- 4xRGBA8 + D24S8 – 18.4mb
- 2xMSAA (Quincunx) – 36.8mb
Memory reused by later rendering stages
- Low res pass, post processing, HUD
View space position computed from depth buffer
Normal.z = sqrt(1 – Normal.x2 – Normal.y2)
- No neg z, but does not cause problems
- 2xFP16 compressed to RGBA8 on write
Motion vectors – screen space
Albedo – material diffuse color
Roughness – specular exponent in log range
Specular intesity – single channel only
Sun Shadow – pre-rendered sun shadows (offline light map)
- Mixed with real-time sun shadows
Lighting accumulation buffer (LAB)
- Geometry pass fills in indirect lighting terms
  - Stored in lightmaps and IBLs
  - Adds ambient color, scene reflections
- Lighting pass adds contribution of each light
Glow – contains HDR luminance of LAB
- Used to reconstruct HDR RGB for bloom

Lighting pass

Most expensive pass
- 100+ dynamic lights per frame
- 10+ shadow casting lights per frame
- AA means more of everything
Optimization
- Avoid hard work
- Work less for MSAA
- Precompute sun shadow offline
- Approximate

Avoid hard work

Don’t run shaders
- Use early z/stencil cull unit
- Depth bounds test is the new cool
- Enable conditional rendering
Optimized light shaders
- For each combination of light features
Fade out shadows for small lights
Remove small objects from shadow map

Lighting pass and MSAA

MSAA facts
- Each sample has to be lit
- Samples of non-edge pixel are equal
KZ2 solution – in shader supersampling
- Run at 1280x720 not 2560x720
- Light two samples in one go

Shadow map filtering distribution

Motivation
- Define filtering quality per pixel rather than per sample.
Split filter coordinates into disjoint sets
- One set per pixel sample
MSAA is almost as fast as non-MSAA

Sunlight

Fullscreen directional light
We divide screen into depth slices
Each depth slice is lit separately
- Different shadow properties
- Used depth bounds test
Use sun shadow from GBuffer
- Stencil mark pixels completely in shadow
  - Skip expensive sunlight shader
- Also mixed with real-time shadows

Sunlight rendering – Fake MSAA

Used only in distance pixels
Cut down lighting cost
- Run lighting equation on closest sample only
Is this wrong?
- Its a hack
- Works correctly against background
- The edges are still partially anti-aliased
- Distant scenery is heavily post processed

Sunlight – shadow map rendering

Generate shadow map for each depth slice
Common approach
- Align shadow map to view direction
- Pros – max shadow map usage
- Result – shadow map shimmering
Fix

GPU driven memory allocation system

Push Buffer building

Multiple SPUs building PB in parallel
Additional SPUs generating data
- Skinning, particles – VB
- IBL interpolation – textures
Common solutions
- Ring Buffering
  - Issue with out of order allocations
- Double Buffering
  - Too much memory

KZ2 render memory allocator

Fixed mem pool
- 22MB block – split into 256k blocks
Each block has associated AllocationID
- Specified by client during allocation
- Only whole block can be allocated
Global FreeID identify free blocks
- Updated as RSX consumes ‘Free’ marker
Lockless, out of order, memory allocation
- From PPU and/or SPU
- Simple table walk (fast!)
Allows immediate memory reuse
- WE generate push-buffer just in time for RSX
- Block can be reused right after RSX consumption
Can allocate memory for skinning early…
- and still free at correct point in frame

Direct3D 11 Tessellation Deep Dive (presented by Matt Lee)

High fidelity characters seem a bit out of reach of real-time apps (games).

10 to 30K chars not out of reach for 360/PS3

Striving for Cinematic Quality Characters

think in terms of triangles currently

Catmull-Clark subdivision surfaces

Industry standard subdivision surface scheme

Modern implementations don’t require too smooth

Direct3D11

Realtime rendering of Catmull-Clark
3 new pipelline stages
Hardware design removes bandwidth bottlenecks from current implmentations
Better use of multi-core processors and improved shader management

(dynamic shader linking)

Direct3D11 Pipeline

Hull Shader
Tesselator
Domain Shader

Tessleation Data Flow

Hull shader – executed per patch
Tessellator – executed per patch (fixed-function) generates triangle
Domain shader – per tessllated vertex

Hull Shader

patch control points (input)
output to Domain Shader
two phases per patch(control points, patch constant)
patch constant output to tessellator (modifies behavior of tessellator)
control points go to domain shader

Tessellator

state from D3D API
input from Patch constant phase
generates tessellated triangles
out to later stages

Domain Shader

Hull Shader and Tessellator output
Smooth surface evaluated
one vertex
Control points are in GPU (saves bandwidth)

Where to use

LOD of terrain
Bezier patches from higher-order surfaces (Catmull-Clark)

Catmull-Clark

Baked into content early
Goal is real time
Disadvantage (have to build offline and huge memory issues)

Loop-Schaefer approximation (D3D11), others exist

Benefits

Content creation easier
Save memory
Easier LOD (doesn’t require sep meshes) so no use for MIP maps?

Pipeline

offline Load control mesh
offline compute adjacency for each quad
offline compute texture tangent space for each vertex
rt Morph & skin the quad mesh in the VS
rt convert quad mesh into patches in hull shader
rt Evaluate patches using domain shader
rt apply displayment map

tangent patches (fixes up surface normals) extrodianry vertex (<4 or > 4)

Available in March 2009 DX SDK (today), next release in June 09

SubD11 sample

Optimization will be performed when hardware is finalized

Current shader design is not expected to perform well on hardware.

Out of Order: Making In-Order Processors Play Nicely (presented by Allan Murphy)

VMX on the 360 for optimization of vector math

Slower than C counterpart, and out of order (so broken)

Missing out of order logic

no instruction reordering
no store forward hardware
smaller caches, slower memory
no l3 cache

LHS
L2 Miss
Expensive, non pipelined instructions
Branch mispredict penalty

Load Hit Store

Store to memory location, then load, flush the L2 cache
Casts, changing register set, aliasing
Passing by value, or by reference
On Pc, instruction reoder and store / forward hardware

L2 Miss

Loading from location, checks cache
Cost ~610 cycles to load cache line
Hot cold split
Reduce in-memory data size
Use cache coherent structures

Expensive Instructions

non pipelined instructions
Stalls hardware threads

Branch Mispredict

Mispredicting branch
23-24 cycle delay
Know how the compiler implements branches
Reduce total branch count for task
Refactor calculations to remove branches
Unroll

Profiling!!!

360 Tools

PIX Cpu instruction trace
LibPMCPB counters
XbPerfView sampling capture

Other Platforms

SN Tuner, vTune

Think laterally

Inline functions
pass and return in register (_declspec(passinreg)
_restrict (complier released from being ultra careful
const

Compiler options

Inline
Prefer speed over size
Fast floaging point over precise
360 (/Ou removing div by zero, /Oc runs a second code scheduling pass)
Reduce parameter counts
Prefer 32, 64, 128 bit parameters
Isoloate constants
Avoid virtual if feasible

Know you cache architecture

Cross core sharing policy (L2 shared, L1 single)
Prefetch mech (dcbt, dcbz128)
L2 1MB, L1 32Kb
Cache line 128 byte

Know your instruction set

360 specific (VMX, slow instructions, fsel, vsel, vcmp*, vrlimi)
PS3 (altivec)
PC (SSE2-4.1 and friends)

What went wrong

Correctness
Guessed at 1 perf issue
SIMD vs straight float
Memory access and L2 usage unchanged
Branch behavior exactly the same

Image Analysis

Gaussian Mixture Model
Profiling showed (86% tiem in pixel cost function)

The PlayStation 3’s SPUs in the Real World (presented by Michiel van der Leeuw)

Things they did on the SPU’s (post mordum)
What worked and didn’t work
Practical advice
Food for thought

3 Years ~120 team size with 27 programmers

Cinematic
Dense
Realistic
Intense

6 x 3.2 Ghz processor
Local mem per SPU
Very fast DMA

Core Requirements

Animation
AI
Skinning
Physics
Compression/Decompression
etc

Graphics

Light probe sampling
~2500 static light probes per level
9x3 Spherical Harmonics in KD-tree
sample light, blend 4 closet light probes, rotate in view space,
bake lights into level

Particle simulation

250 particle systems per frame
150 drawn
3000 particles updated
200 colision ray cast
System grown over time

Refactoered

Vertex generation
Particle simulation inner loop
Initilaization & deletion of particles
High-level management / glue

Not done on SPU

Updated global scene graph
Starting & stopping sounds

Image Post Processing

Effects done on SPU

Moiton blur
Depth of field
Bloom

Spu assist the RSX with post-processing

RSX prepares low-rew image buffers
RSX triggers interrupt to start SPUs
SPUs perform image operations
RSX already starts next frame
Result in SPU processed by RSX early in next frame
Similar to PhyreEngine now

SPUs are compute-bound

Bandwidth no issue
Code can be optimized

Our trade of: RSX vs SPU time

SPUs take longer
SPUs look better
RSX was the bottleneck

Bloom and Lens Relection

13% on one spu
Depth dependnd intensity response curve
7x7 guassian blur
Upscaling resulst from deifferent levels
Internal Lens Relfection
Result buffer

Waypoint cover maps –> depth map

IBL Sampling

SPU cost a lot of dev time

Code is future proof, scales to more cores, supports the items they require.

The future is memory-local and excessively parallel

SPUS are just one of these ‘new architectures’

Optimize for the concept

Keep code portable

Parallelization of code takes time

Treat CPU as cluster

Think in workloads / jobs

Build latency in algorithms

Don’t optimize too early

Lockless Programming in Games (Bruce Dawson, Microsoft)

Current Hardware

360 – 6 hardware threads
PS3 – 9 hardware threads
Windows – Quad cores not uncommon
Point being multi-core is here to stay

Multithreading is mandatory if you want to harness the available power. If not you are really wasting the advanced features of the hardware.

Multithreaded programming is easy if you don’t share data. :) Of course this is not usually an option.

Best way to share data between threads is by using locks. This is important. Lockless is not a one-size fits all approach.

Lockless programming typically involves a job queue, using STL queue. The problem is STL queues not thread safe. So we have to make them safe. :)

Solution, use critical section to block off the code.

Bad things

Acquiring and releasing locks takes time
Deadlocks
Contention – waiting, holding locks too long
Priority inversions – system threads on 360 do this (too often)

Use locks carefully or lockless

Safely share data without locks (no deadlocks or priority inversion)
Cons

sList (singly linked list) InterlockedPushEntrySList

This is NOT a queue! This is a stack! Don’t use on 360!!!!!

One writer, one reader (singleton) (works on paper, not in real world)

Read data (cpu to L2), write (cpu to L2)

Writes can happen before getting put in L2 cache

Happens on reads too (second read could come from L1)

read and write can pass each other

Power PC read / writes can pass each other but on x86 only load can pass a store

Reads not passing writes would basically disable L1, huge perf hit

publisher / subscriber model

ExportBarrier – no passing sign (stop sign) HANDLE BOTH reads and writes

Compilers are just as evil, rearrange code (single threaded)

Compiler/CPU reordering barriers needed

_ReadWriteBarrier(); x86

_lwsync(); PowerPC (both cpu and compiler)

Positioning is crucial (barrier between writes)

write-release semantics is the name

read-acquire semantics is the name

reader needs both read / write

Dekker’s / Peterson’s Algorithm

MemoryBarrier

x86 _asm xchange Barrier, eax
x64 _FastStorefence()
power _sync();

what about volatile

standard volatile…..NO

doesn’t prevent CPU reordering and all variables would need to tagged volatile

VC++ is better, doesn’t prevent hardware reodering on 360

Acts as read-acquire / write-release on x86/x64 and Itanium

atomic <T> in C++0x

Double checked locking – singleton

InterlockedXxx

doesn’t work on 360

its a full barrier on x86/x64/Itanium

InterlockedXxx Acquire/Release are portable (preferred)

Uses

Reference counts
Setting a flag
Publish/Subscribe
SLists
XMCore on 360
Double checked locking

Export, import, full barriers

Prefer to use locks!!!!

use lockless when locks are too costly

http://msdn.microsoft.com/en-us/library/bb310595(VS.85).aspx

Keynote: Discovering new development opportunities (Satoru Iwata, Nintendo)

The day starts out with a packed room, all waiting to hear the keynote to be delivered by Satoru Iwata, President of Nintendo. With the unquestionable success of the Nintendo products worldwide this talk is to pull back the curtain a bit on some the ideas and methods that have lead to this success.

Iwata started off by presenting the obligatory numbers slides showing how much success both the Wii and DS have shown in recent history. No one can take that away from them, they have several successful platforms currently.

Iwata then started to discuss, arguably the most important developer at Nintendo, Mr. Miyamoto. He explained that Miyamoto is one of the main reasons for the continued success. He explained in somewhat detailed terms, how Miyamoto’s development style has achieved this success.

Mr. Miyamoto first starts with a core concept, as do most software projects. One of the differences comes from where Miyamoto pulls the new ideas from. He is fascinated with studying humans and their behaviors, specifically, when they are doing something that makes them happy. He will draw on this, to come up the concept for the new piece of software that he is attempting to create. For example, he got a dog for his family, and out of this, was born Nintendogs.

Of course, having a good idea for a game concept, and following this through to release of a successful title are 2 very different things. This brings up the next key point. This is Miyamoto’s software development style. He typically will form a very small team (sometimes even just one developer) and they will work on a prototype, or rather a series of prototypes. At this stage, the graphics are very crude (boxes). They will work on this for how ever long this takes to perfect the core concept. At this stage, no even the president of the company will ask how things are going, or when this will be ready for the next stage. It should be noted that sometimes at this stage, work will be done, but then shelved. This could happen for various reasons, but almost always, at least some of this will be used at a later time.

If the prototyping has met Miyamoto’s satisfaction, only at this stage will others be brought in on the game. This is where the polish comes in (graphics) and such, but the core gameplay is pretty much guaranteed at this point. This saves from the issue of after spending time on later polish items, a core gameplay elements requires a rewrite. This almost never happens with this style.

Nintendo also has some unique “playtest” elements to the project. They do not conduct formal playtests. Instead, Miyamoto will “kidnap” an employee (non-technical) and have them play the game (with no help). He then checks to see how it works out. If they are able to understand a play with no help, the dev team has done their job. If not, its a failure and will be readjusted.

Next, Iwata unveiled the new Virtual console with larger SD support and options to run from SD. Also some game demos of future titles were shown. Also, Rhythm Heaven was introduced, and he gave everyone in attendance a free copy, before this can be bought.