Thursday, March 26, 2009

The Rendering Technology of KillZone 2 (Michal Valient)

How we made Killzone 2 run @ 30FPS

  • Deferred shading
  • Diet for render targets
  • Dirty lighting tricks
  • Rendering, memory and SPUs

Deferred shading

  • not forward rendering
  • Geometry pass – fill the GBuffer (all material info for lighting)
  • loading depth map, normal / bump map, albedo (diffuse color and texture), shininess (reflective materials)
  • Lighting pass – accumulate info (only light, no textures)

GBuffer

  • RGBA FP16 buffers proved to be too much
  • Moved to RGBA8
    • 4xRGBA8 + D24S8 – 18.4mb
    • 2xMSAA (Quincunx) – 36.8mb
  • Memory reused by later rendering stages
    • Low res pass, post processing, HUD
  • View space position computed from depth buffer
  • Normal.z = sqrt(1 – Normal.x2 – Normal.y2)
    • No neg z, but does not cause problems
    • 2xFP16 compressed to RGBA8 on write
  • Motion vectors – screen space
  • Albedo – material diffuse color
  • Roughness – specular exponent in log range
  • Specular intesity – single channel only
  • Sun Shadow – pre-rendered sun shadows (offline light map)
    • Mixed with real-time sun shadows
  • Lighting accumulation buffer (LAB)
    • Geometry pass fills in indirect lighting terms
      • Stored in lightmaps and IBLs
      • Adds ambient color, scene reflections
    • Lighting pass adds contribution of each light
  • Glow – contains HDR luminance of LAB
    • Used to reconstruct HDR RGB for bloom

Lighting pass

  • Most expensive pass
    • 100+ dynamic lights per frame
    • 10+ shadow casting lights per frame
    • AA means more of everything
  • Optimization
    • Avoid hard work
    • Work less for MSAA
    • Precompute sun shadow offline
    • Approximate

Avoid hard work

  • Don’t run shaders
    • Use early z/stencil cull unit
    • Depth bounds test is the new cool
    • Enable conditional rendering
  • Optimized light shaders
    • For each combination of light features
  • Fade out shadows for small lights
  • Remove small objects from shadow map

Lighting pass and MSAA

  • MSAA facts
    • Each sample has to be lit
    • Samples of non-edge pixel are equal
  • KZ2 solution – in shader supersampling
    • Run at 1280x720 not 2560x720
    • Light two samples in one go

Shadow map filtering distribution

  • Motivation
    • Define filtering quality per pixel rather than per sample.
  • Split filter coordinates into disjoint sets
    • One set per pixel sample
  • MSAA is almost as fast as non-MSAA

Sunlight

  • Fullscreen directional light
  • We divide screen into depth slices
  • Each depth slice is lit separately
    • Different shadow properties
    • Used depth bounds test
  • Use sun shadow from GBuffer
    • Stencil mark pixels completely in shadow
      • Skip expensive sunlight shader
    • Also mixed with real-time shadows

Sunlight rendering – Fake MSAA

  • Used only in distance pixels
  • Cut down lighting cost
    • Run lighting equation on closest sample only
  • Is this wrong?
    • Its a hack
    • Works correctly against background
    • The edges are still partially anti-aliased
    • Distant scenery is heavily post processed

Sunlight – shadow map rendering

  • Generate shadow map for each depth slice
  • Common approach
    • Align shadow map to view direction
    • Pros – max shadow map usage
    • Result – shadow map shimmering
  • Fix
      • Remove shadow map rotation
        • Align shadow maps to world instead of view
        • Remove sub-pixel movement
        • Cons – unused shadow map space

GPU driven memory allocation system

Push Buffer building

  • Multiple SPUs building PB in parallel
  • Additional SPUs generating data
    • Skinning, particles – VB
    • IBL interpolation – textures
  • Common solutions
    • Ring Buffering
      • Issue with out of order allocations
    • Double Buffering
      • Too much memory

KZ2 render memory allocator

  • Fixed mem pool
    • 22MB block – split into 256k blocks
  • Each block has associated AllocationID
    • Specified by client during allocation
    • Only whole block can be allocated
  • Global FreeID identify free blocks
    • Updated as RSX consumes ‘Free’ marker
  • Lockless, out of order, memory allocation
    • From PPU and/or SPU
    • Simple table walk (fast!)
  • Allows immediate memory reuse
    • WE generate push-buffer just in time for RSX
    • Block can be reused right after RSX consumption
  • Can allocate memory for skinning early…
    • and still free at correct point in frame

No comments: