Thursday, March 26, 2009

The Beauty of Destruction (Pete Isensee, Microsoft)

C++ Destructor Definition

  • One
  • Special
  • Deterministic –> called at well defined times
  • Automatic –> object out of scope or delete
  • Symmetric –> constructor fits
  • Member –> part of a class
  • Function
  • With
    • A special name (~)
    • No parameters
    • No return type
  • Designed to
    • Give last rites
    • Before object death

C# uses finalizer different (called by GC) non-deterministic, same in Java

When destructors are called

  • Global or static object, called when terminates
  • Arrays destructed in reverse way
  • STL container , unspecified order
  • delete operator
  • out of scope
  • temp objects
  • exception thrown (stack unwinding)
  • explicitly
  • exit()
  • abort (does not call destructor)

Order of destruction

  • Rule of thumb: Reverse order of construction
  • Specifically
    • Destructor body
    • Data members in reverse order of declaration
    • Direct non-virtual base classes in reverse order
    • Virtual base classes in reverse order

Implicit Destructors

  • not specified by programmer
  • inline by default
  • public
  • recommended for struct-like POD-only objects
  • for everything else, avoid implicit destructors
    • better debugging
    • improved perf analysis

Trivial Destructors

  • Implicit
  • Not virtual
  • All direct base classes have trivial dtors
  • All non-static members have trivial dtors
  • Destructors that never do anything

Virtual Destructors

  • Guarantee that derived classes get cleaned up
  • Rule of thumb: if class has virtual functions, dtor should be virtual
    • if delete on Base* could ever point to a Derived*
  • Perf: Obj with any virtual funcs includes a vtable ptr
  • Idiom exceptions: mixin classes
  • Pure signals abstract class (virtual ~T() = 0{})

Partial Construction & Destruction

  • Dtors are only called for fully constructed objects
  • if a ctor throws, obj was not fully constructed
    • obj dtor will not be called
    • but fully constructed subobjects will be destroyed
  • Always use RAII with ctors
    • Resource Acquisition Is Initialization

Virtual Functions in Destructors

  • Virtual functions are not virtual inside dtors

C++ Exception Handling

  • Destructors : Exceptions :: Spock : Kirk
  • Wrap any function that acquires a resource in a class where dtor releases the resource
  • Never allow an exception to exit a dtor
    • Best: don’t throw in dtor
    • OK: wrap throwing code in a try/catch
  • Good advice even if you don’t use C++EH

Multithreading

  • You are responsible for protecting objects and their contents
  • Sharing an object across threads
    • Use shared_ptr
    • or some other reference counting
    • or otherwise ensure only one thread can destroy
  • Protect shared memory (global counters, ref counts) in dtor

delete and Destructors

  • delete p is a two-step process

Explicit Destructors

  • Destructors can be called directly
  • Avoid 99.9% of the time
  • Very powerful for custom memory scenarios
  • Examples
    • w / placement new
    • STL allocators

std::allocator

  • Allocators enable custom STL container memory
  • Two key destructive functions

shared_ptr

  • Templated non-intrusive deterministically referenced-counted smart pointer

shared_ptr deleters

  • Deleter : a functor called on the stored raw pointer when ref count hits zero

Performance

  • Destructors are called a LOT
  • they are invisible in code
  • streamline common dtors
  • the best dtor is empty
  • inlining
  • profile

The Rendering Technology of KillZone 2 (Michal Valient)

How we made Killzone 2 run @ 30FPS

  • Deferred shading
  • Diet for render targets
  • Dirty lighting tricks
  • Rendering, memory and SPUs

Deferred shading

  • not forward rendering
  • Geometry pass – fill the GBuffer (all material info for lighting)
  • loading depth map, normal / bump map, albedo (diffuse color and texture), shininess (reflective materials)
  • Lighting pass – accumulate info (only light, no textures)

GBuffer

  • RGBA FP16 buffers proved to be too much
  • Moved to RGBA8
    • 4xRGBA8 + D24S8 – 18.4mb
    • 2xMSAA (Quincunx) – 36.8mb
  • Memory reused by later rendering stages
    • Low res pass, post processing, HUD
  • View space position computed from depth buffer
  • Normal.z = sqrt(1 – Normal.x2 – Normal.y2)
    • No neg z, but does not cause problems
    • 2xFP16 compressed to RGBA8 on write
  • Motion vectors – screen space
  • Albedo – material diffuse color
  • Roughness – specular exponent in log range
  • Specular intesity – single channel only
  • Sun Shadow – pre-rendered sun shadows (offline light map)
    • Mixed with real-time sun shadows
  • Lighting accumulation buffer (LAB)
    • Geometry pass fills in indirect lighting terms
      • Stored in lightmaps and IBLs
      • Adds ambient color, scene reflections
    • Lighting pass adds contribution of each light
  • Glow – contains HDR luminance of LAB
    • Used to reconstruct HDR RGB for bloom

Lighting pass

  • Most expensive pass
    • 100+ dynamic lights per frame
    • 10+ shadow casting lights per frame
    • AA means more of everything
  • Optimization
    • Avoid hard work
    • Work less for MSAA
    • Precompute sun shadow offline
    • Approximate

Avoid hard work

  • Don’t run shaders
    • Use early z/stencil cull unit
    • Depth bounds test is the new cool
    • Enable conditional rendering
  • Optimized light shaders
    • For each combination of light features
  • Fade out shadows for small lights
  • Remove small objects from shadow map

Lighting pass and MSAA

  • MSAA facts
    • Each sample has to be lit
    • Samples of non-edge pixel are equal
  • KZ2 solution – in shader supersampling
    • Run at 1280x720 not 2560x720
    • Light two samples in one go

Shadow map filtering distribution

  • Motivation
    • Define filtering quality per pixel rather than per sample.
  • Split filter coordinates into disjoint sets
    • One set per pixel sample
  • MSAA is almost as fast as non-MSAA

Sunlight

  • Fullscreen directional light
  • We divide screen into depth slices
  • Each depth slice is lit separately
    • Different shadow properties
    • Used depth bounds test
  • Use sun shadow from GBuffer
    • Stencil mark pixels completely in shadow
      • Skip expensive sunlight shader
    • Also mixed with real-time shadows

Sunlight rendering – Fake MSAA

  • Used only in distance pixels
  • Cut down lighting cost
    • Run lighting equation on closest sample only
  • Is this wrong?
    • Its a hack
    • Works correctly against background
    • The edges are still partially anti-aliased
    • Distant scenery is heavily post processed

Sunlight – shadow map rendering

  • Generate shadow map for each depth slice
  • Common approach
    • Align shadow map to view direction
    • Pros – max shadow map usage
    • Result – shadow map shimmering
  • Fix
      • Remove shadow map rotation
        • Align shadow maps to world instead of view
        • Remove sub-pixel movement
        • Cons – unused shadow map space

GPU driven memory allocation system

Push Buffer building

  • Multiple SPUs building PB in parallel
  • Additional SPUs generating data
    • Skinning, particles – VB
    • IBL interpolation – textures
  • Common solutions
    • Ring Buffering
      • Issue with out of order allocations
    • Double Buffering
      • Too much memory

KZ2 render memory allocator

  • Fixed mem pool
    • 22MB block – split into 256k blocks
  • Each block has associated AllocationID
    • Specified by client during allocation
    • Only whole block can be allocated
  • Global FreeID identify free blocks
    • Updated as RSX consumes ‘Free’ marker
  • Lockless, out of order, memory allocation
    • From PPU and/or SPU
    • Simple table walk (fast!)
  • Allows immediate memory reuse
    • WE generate push-buffer just in time for RSX
    • Block can be reused right after RSX consumption
  • Can allocate memory for skinning early…
    • and still free at correct point in frame

Direct3D 11 Tessellation Deep Dive (presented by Matt Lee)

High fidelity characters seem a bit out of reach of real-time apps (games).

10 to 30K chars not out of reach for 360/PS3

Striving for Cinematic Quality Characters

think in terms of triangles currently

 

Catmull-Clark subdivision surfaces

  • Industry standard subdivision surface scheme

Modern implementations don’t require too smooth

 

Direct3D11

  • Realtime rendering of Catmull-Clark
  • 3 new pipelline stages
  • Hardware design removes bandwidth bottlenecks from current implmentations
  • Better use of multi-core processors and improved shader management

(dynamic shader linking)

 

Direct3D11 Pipeline

  • Hull Shader
  • Tesselator
  • Domain Shader

Tessleation Data Flow

  • Hull shader – executed per patch
  • Tessellator – executed per patch (fixed-function) generates triangle
  • Domain shader – per tessllated vertex

Hull Shader

  • patch control points (input)
  • output to Domain Shader
  • two phases per patch(control points, patch constant)
  • patch constant output to tessellator (modifies behavior of tessellator)
  • control points go to domain shader

Tessellator

  • state from D3D API
  • input from Patch constant phase
  • generates tessellated triangles
  • out to later stages

Domain Shader

  • Hull Shader and Tessellator output
  • Smooth surface evaluated
  • one vertex
  • Control points are in GPU (saves bandwidth)

Where to use

  • LOD of terrain
  • Bezier patches from higher-order surfaces (Catmull-Clark)

Catmull-Clark

  • Baked into content early
  • Goal is real time
  • Disadvantage (have to build offline and huge memory issues)

Loop-Schaefer approximation (D3D11), others exist

Benefits

  • Content creation easier
  • Save memory
  • Easier LOD (doesn’t require sep meshes) so no use for MIP maps?

Pipeline

  • offline Load control mesh
  • offline compute adjacency for each quad
  • offline compute texture tangent space for each vertex
  • rt Morph & skin the quad mesh in the VS
  • rt convert quad mesh into patches in hull shader
  • rt Evaluate patches using domain shader
  • rt apply displayment map

tangent patches (fixes up surface normals) extrodianry vertex (<4 or > 4)

Available in March 2009 DX SDK (today), next release in June 09

SubD11 sample

Optimization will be performed when hardware is finalized

Current shader design is not expected to perform well on hardware.

Out of Order: Making In-Order Processors Play Nicely (presented by Allan Murphy)

VMX on the 360 for optimization of vector math

Slower than C counterpart, and out of order (so broken)

Missing out of order logic

  • no instruction reordering
  • no store forward hardware
  • smaller caches, slower memory
  • no l3 cache

 

  • LHS
  • L2 Miss
  • Expensive, non pipelined instructions
  • Branch mispredict penalty

Load Hit Store

  • Store to memory location, then load, flush the L2 cache
  • Casts, changing register set, aliasing
  • Passing by value, or by reference
  • On Pc, instruction reoder and store / forward hardware

L2 Miss

  • Loading from location, checks cache
  • Cost ~610 cycles to load cache line
  • Hot cold split
  • Reduce in-memory data size
  • Use cache coherent structures

Expensive Instructions

  • non pipelined instructions
  • Stalls hardware threads

Branch Mispredict

  • Mispredicting branch
  • 23-24 cycle delay
  • Know how the compiler implements branches
  • Reduce total branch count for task
  • Refactor calculations to remove branches
  • Unroll

Profiling!!!

360 Tools

  • PIX Cpu instruction trace
  • LibPMCPB counters
  • XbPerfView sampling capture

Other Platforms

  • SN Tuner, vTune

Think laterally

  • Inline functions
  • pass and return in register (_declspec(passinreg)
  • _restrict (complier released from being ultra careful
  • const

 

Compiler options

  • Inline
  • Prefer speed over size
  • Fast floaging point over precise
  • 360 (/Ou removing div by zero, /Oc runs a second code scheduling pass)
  • Reduce parameter counts
  • Prefer 32, 64, 128 bit parameters
  • Isoloate constants
  • Avoid virtual if feasible

Know you cache architecture

  • Cross core sharing policy (L2 shared, L1 single)
  • Prefetch mech (dcbt, dcbz128)
  • L2 1MB, L1 32Kb
  • Cache line 128 byte

Know your instruction set

  • 360 specific (VMX, slow instructions, fsel, vsel, vcmp*, vrlimi)
  • PS3 (altivec)
  • PC (SSE2-4.1 and friends)

What went wrong

  • Correctness
  • Guessed at 1 perf issue
  • SIMD vs straight float
  • Memory access and L2 usage unchanged
  • Branch behavior exactly the same

Image Analysis

  • Gaussian Mixture Model
  • Profiling showed (86% tiem in pixel cost function)

The PlayStation 3’s SPUs in the Real World (presented by Michiel van der Leeuw)

  • Things they did on the SPU’s (post mordum)
  • What worked and didn’t work
  • Practical advice
  • Food for thought

3 Years ~120 team size with 27 programmers

  • Cinematic
  • Dense
  • Realistic
  • Intense

 

  • 6 x 3.2 Ghz processor
  • Local mem per SPU
  • Very fast DMA

 

  • Core Requirements
    • Animation
    • AI
    • Skinning
    • Physics
    • Compression/Decompression
    • etc

Graphics

  • Light probe sampling
  • ~2500 static light probes per level
  • 9x3 Spherical Harmonics in KD-tree
  • sample light, blend 4 closet light probes, rotate in view space,
  • bake lights into level

Particle simulation

  •   250 particle systems per frame
  • 150 drawn
  • 3000 particles updated
  • 200 colision ray cast
  • System grown over time

Refactoered

  • Vertex generation
  • Particle simulation inner loop
  • Initilaization & deletion of particles
  • High-level management / glue

Not done on SPU

  • Updated global scene graph
  • Starting & stopping sounds

Image Post Processing

Effects done on SPU

  • Moiton blur
  • Depth of field
  • Bloom

Spu assist the RSX with post-processing

  • RSX prepares low-rew image buffers
  • RSX triggers interrupt to start SPUs
  • SPUs perform image operations
  • RSX already starts next frame
  • Result in SPU processed by RSX early in next frame
  • Similar to PhyreEngine now

SPUs are compute-bound

  • Bandwidth no issue
  • Code can be optimized

Our trade of: RSX vs SPU time

  • SPUs take longer
  • SPUs look better
  • RSX was the bottleneck

Bloom and Lens Relection

  • 13% on one spu
  • Depth dependnd intensity response curve
  • 7x7 guassian blur
  • Upscaling resulst from deifferent levels
  • Internal Lens Relfection
  • Result buffer

Waypoint cover maps –> depth map

IBL Sampling

SPU cost a lot of dev time

Code is future proof, scales to more cores, supports the items they require.

The future is memory-local and excessively parallel

SPUS are just one of these ‘new architectures’

Optimize for the concept

Keep code portable

Parallelization of code takes time

Treat CPU as cluster

Think in workloads / jobs

Build latency in algorithms

Don’t optimize too early

Lockless Programming in Games (Bruce Dawson, Microsoft)

Current Hardware

  • 360 – 6 hardware threads
  • PS3 – 9 hardware threads
  • Windows – Quad cores not uncommon
  • Point being multi-core is here to stay

Multithreading is mandatory if you want to harness the available power.  If not you are really wasting the advanced features of the hardware.

Multithreaded programming is easy if you don’t share data.  :)  Of course this is not usually an option.

Best way to share data between threads is by using locks.  This is important.  Lockless is not a one-size fits all approach.

Lockless programming typically involves a job queue, using STL queue.  The problem is STL queues not thread safe.  So we have to make them safe. :)

Solution, use critical section to block off the code.

Bad things

  • Acquiring and releasing locks takes time
  • Deadlocks
  • Contention – waiting, holding locks too long
  • Priority inversions – system threads on 360 do this (too often)

Use locks carefully or lockless

  • Safely share data without locks (no deadlocks or priority inversion)
  • Cons
      • Very limited, tricky, generally not portable

sList (singly linked list) InterlockedPushEntrySList

This is NOT a queue!  This is a stack!  Don’t use on 360!!!!!

One writer, one reader (singleton) (works on paper, not in real world)

Read data (cpu to L2), write (cpu to L2)

Writes can happen before getting put in L2 cache

Happens on reads too (second read could come from L1)

read and write can pass each other

 

Power PC read / writes can pass each other but on x86 only load can pass a store

Reads not passing writes would basically disable L1, huge perf hit

publisher / subscriber model

ExportBarrier – no passing sign (stop sign) HANDLE BOTH reads and writes

 

Compilers are just as evil, rearrange code (single threaded)

Compiler/CPU reordering barriers needed

_ReadWriteBarrier();   x86

_lwsync();  PowerPC  (both cpu and compiler)

Positioning is crucial (barrier between writes)

write-release semantics is the name

read-acquire semantics is the name

reader needs both read / write

 

Dekker’s / Peterson’s Algorithm

 

MemoryBarrier

  • x86 _asm xchange Barrier, eax
  • x64 _FastStorefence()
  • power _sync();

 

what about volatile

standard volatile…..NO

doesn’t prevent CPU reordering and all variables would need to tagged volatile

VC++ is better, doesn’t prevent hardware reodering on 360

Acts as read-acquire / write-release on x86/x64 and Itanium

atomic <T> in C++0x

Double checked locking – singleton

 

InterlockedXxx

doesn’t work on 360

its a full barrier on x86/x64/Itanium

InterlockedXxx Acquire/Release are portable (preferred)

Uses

  • Reference counts
  • Setting a flag
  • Publish/Subscribe
  • SLists
  • XMCore on 360
  • Double checked locking

Export, import, full barriers

Prefer to use locks!!!!

use lockless when locks are too costly

http://msdn.microsoft.com/en-us/library/bb310595(VS.85).aspx

Keynote: Discovering new development opportunities (Satoru Iwata, Nintendo)

The day starts out with a packed room, all waiting to hear the keynote to be delivered by Satoru Iwata, President of Nintendo.  With the unquestionable success of the Nintendo products worldwide this talk is to pull back the curtain a bit on some the ideas and methods that have lead to this success.

Iwata started off by presenting the obligatory numbers slides showing how much success both the Wii and DS have shown in recent history.  No one can take that away from them, they have several successful platforms currently.

Iwata then started to discuss, arguably the most important developer at Nintendo, Mr. Miyamoto.  He explained that Miyamoto is one of the main reasons for the continued success.  He explained in somewhat detailed terms, how Miyamoto’s development style has achieved this success.

Mr. Miyamoto first starts with a core concept, as do most software projects.  One of the differences comes from where Miyamoto pulls the new ideas from.  He is fascinated with studying humans and their behaviors, specifically, when they are doing something that makes them happy.  He will draw on this, to come up the concept for the new piece of software that he is attempting to create.  For example, he got a dog for his family, and out of this, was born Nintendogs.

Of course, having a good idea for a game concept, and following this through to release of a successful title are 2 very different things.  This brings up the next key point.  This is Miyamoto’s software development style.  He typically will form a very small team (sometimes even just one developer) and they will work on a prototype, or rather a series of prototypes.  At this stage, the graphics are very crude (boxes).  They will work on this for how ever long this takes to perfect the core concept.   At this stage, no even the president of the company will ask how things are going, or when this will be ready for the next stage.  It should be noted that sometimes at this stage, work will be done, but then shelved.  This could happen for various reasons, but almost always, at least some of this will be used at a later time.

If the prototyping has met Miyamoto’s satisfaction, only at this stage will others be brought in on the game.   This is where the polish comes in (graphics) and such, but the core gameplay is pretty much guaranteed at this point.  This saves from the issue of after spending time on later polish items, a core gameplay elements requires a rewrite.  This almost never happens with this style.

Nintendo also has some unique “playtest” elements to the project.  They do not conduct formal playtests.  Instead, Miyamoto will “kidnap” an employee (non-technical) and have them play the game (with no help).  He then checks to see how it works out.  If they are able to understand a play with no help, the dev team has done their job.  If not, its a failure and will be readjusted.

Next, Iwata unveiled the new Virtual console with larger SD support and options to run from SD.  Also some game demos of future titles were shown.  Also, Rhythm Heaven was introduced, and he gave everyone in attendance a free copy, before this can be bought.