Thursday, May 13, 2010

Silverlight on Windows 7 Phone – Performance Tuning

Recently, I have been reviewing some of the content from MIX10, specifically pertaining to the talks on the new Windows Phone.  I came upon the talk from Seema Ramchandani (http://live.visitmix.com/MIX10/Sessions/CL60).  This was full of good tips on performance tools and tuning for Silverlight on the phone.  I wanted to share the items that stood out to me.

First, the idea of having a UI thread (that uses the phone CPU) and a Render thread (that runs on the GPU) is key to making a smooth rich interface.  The UI thread should be reserved for input primarily with the animations and other calculations being done on the render thread.

UI Virtualization-

For instance the listbox shows by default (styles) 6 rows.  What is hidden is that the 5 above and 5 below are also in the visual tree.  These are not recreated as the ui is scrolled but simply reused.

The Loop of all Loops (render loop)

A timer in triggers the render loop every 33 milliseconds and attempts to draw.  First we check for property changes (ie. movement).  Then the visual tree is recurred 2 times.  First to see “how big” everything is.  Then on the second pass arranging the items to be rendered is performed (and clipping).  Next we queue up the rendering changes.  Only one back buffer is rasterized per frame.  And then the buffer is rendered.

Tools

  • Enable frame rate counter (in main)
  • Enable redraw regions (to see what is being redrawn in different colors)

We should only be drawing the changes, and the redraw regions flag will show you what this is.  Lots of colors is bad :).

Threads

  • Render thread
  • UI thread
  • Child threads
      • Rasterization
      • Media Decoding

 

  • Less is more – keep the app lean for best performance (goes without saying)
  • Limit activity on UI thread
  • Leverage render thread
  • Leverage GPU
  • Debug, Debug, Debug

Active input on the device can take 15-25% of the device CPU!  Thus, we need the render thread to keep the application responding real time.

Quick way to check if we are using the GPU.  Enable the frame rate counter.  If this is rendered when you run the application, then the GPU is being used.

Use CacheMode=”Bitmap Cache” to force bitmap caching for GPU optimization.  Certain items will cache automatically (ie. double animations).

Order is important in the xaml to keep the cached bitmaps close together (less textures), which will increase gpu performance.

Media and effects in the same frame == bad!

Max texture size on desktop and phone is 2048 in any direction.

60 fps in the CTP, but will be limited to 30 fps max on release build (to keep battery life as high as possible).

Thursday, January 21, 2010

Global Game Jam

Had a friend mention the Global Game Jam to me today.  I was still able to register for the New York City location.  The event takes place between January 29th and 31st.  The NYC location currently has 66 jammers registered.  I will post updates as this takes place.

Global Game Jam Site

Saturday, May 9, 2009

Project: GPU replacement

Recently, my otherwise perfect XPS laptop experienced a GPU failure.  It was time for me to upgrade my laptop anyway, so I went ahead a got a 1730 to replace this 1710.  I then contacted a Dell reseller in CA for a replacement GPU (these things are $680 from Dell) and was able to secure a new one for $400.  This is Nvidia 7950 GTX, the largest GPU available for the XPS M1710.  I received the new GPU a few days ago and installed today.  It required pretty much stripping the laptop but all is working fine now.  Pics can be found here.  BTW, the 1730 is purchased has dual 8800 GTX (for a grand total of 1GB video memory) with SLI.  It runs well. :)

Thursday, March 26, 2009

The Beauty of Destruction (Pete Isensee, Microsoft)

C++ Destructor Definition

  • One
  • Special
  • Deterministic –> called at well defined times
  • Automatic –> object out of scope or delete
  • Symmetric –> constructor fits
  • Member –> part of a class
  • Function
  • With
    • A special name (~)
    • No parameters
    • No return type
  • Designed to
    • Give last rites
    • Before object death

C# uses finalizer different (called by GC) non-deterministic, same in Java

When destructors are called

  • Global or static object, called when terminates
  • Arrays destructed in reverse way
  • STL container , unspecified order
  • delete operator
  • out of scope
  • temp objects
  • exception thrown (stack unwinding)
  • explicitly
  • exit()
  • abort (does not call destructor)

Order of destruction

  • Rule of thumb: Reverse order of construction
  • Specifically
    • Destructor body
    • Data members in reverse order of declaration
    • Direct non-virtual base classes in reverse order
    • Virtual base classes in reverse order

Implicit Destructors

  • not specified by programmer
  • inline by default
  • public
  • recommended for struct-like POD-only objects
  • for everything else, avoid implicit destructors
    • better debugging
    • improved perf analysis

Trivial Destructors

  • Implicit
  • Not virtual
  • All direct base classes have trivial dtors
  • All non-static members have trivial dtors
  • Destructors that never do anything

Virtual Destructors

  • Guarantee that derived classes get cleaned up
  • Rule of thumb: if class has virtual functions, dtor should be virtual
    • if delete on Base* could ever point to a Derived*
  • Perf: Obj with any virtual funcs includes a vtable ptr
  • Idiom exceptions: mixin classes
  • Pure signals abstract class (virtual ~T() = 0{})

Partial Construction & Destruction

  • Dtors are only called for fully constructed objects
  • if a ctor throws, obj was not fully constructed
    • obj dtor will not be called
    • but fully constructed subobjects will be destroyed
  • Always use RAII with ctors
    • Resource Acquisition Is Initialization

Virtual Functions in Destructors

  • Virtual functions are not virtual inside dtors

C++ Exception Handling

  • Destructors : Exceptions :: Spock : Kirk
  • Wrap any function that acquires a resource in a class where dtor releases the resource
  • Never allow an exception to exit a dtor
    • Best: don’t throw in dtor
    • OK: wrap throwing code in a try/catch
  • Good advice even if you don’t use C++EH

Multithreading

  • You are responsible for protecting objects and their contents
  • Sharing an object across threads
    • Use shared_ptr
    • or some other reference counting
    • or otherwise ensure only one thread can destroy
  • Protect shared memory (global counters, ref counts) in dtor

delete and Destructors

  • delete p is a two-step process

Explicit Destructors

  • Destructors can be called directly
  • Avoid 99.9% of the time
  • Very powerful for custom memory scenarios
  • Examples
    • w / placement new
    • STL allocators

std::allocator

  • Allocators enable custom STL container memory
  • Two key destructive functions

shared_ptr

  • Templated non-intrusive deterministically referenced-counted smart pointer

shared_ptr deleters

  • Deleter : a functor called on the stored raw pointer when ref count hits zero

Performance

  • Destructors are called a LOT
  • they are invisible in code
  • streamline common dtors
  • the best dtor is empty
  • inlining
  • profile

The Rendering Technology of KillZone 2 (Michal Valient)

How we made Killzone 2 run @ 30FPS

  • Deferred shading
  • Diet for render targets
  • Dirty lighting tricks
  • Rendering, memory and SPUs

Deferred shading

  • not forward rendering
  • Geometry pass – fill the GBuffer (all material info for lighting)
  • loading depth map, normal / bump map, albedo (diffuse color and texture), shininess (reflective materials)
  • Lighting pass – accumulate info (only light, no textures)

GBuffer

  • RGBA FP16 buffers proved to be too much
  • Moved to RGBA8
    • 4xRGBA8 + D24S8 – 18.4mb
    • 2xMSAA (Quincunx) – 36.8mb
  • Memory reused by later rendering stages
    • Low res pass, post processing, HUD
  • View space position computed from depth buffer
  • Normal.z = sqrt(1 – Normal.x2 – Normal.y2)
    • No neg z, but does not cause problems
    • 2xFP16 compressed to RGBA8 on write
  • Motion vectors – screen space
  • Albedo – material diffuse color
  • Roughness – specular exponent in log range
  • Specular intesity – single channel only
  • Sun Shadow – pre-rendered sun shadows (offline light map)
    • Mixed with real-time sun shadows
  • Lighting accumulation buffer (LAB)
    • Geometry pass fills in indirect lighting terms
      • Stored in lightmaps and IBLs
      • Adds ambient color, scene reflections
    • Lighting pass adds contribution of each light
  • Glow – contains HDR luminance of LAB
    • Used to reconstruct HDR RGB for bloom

Lighting pass

  • Most expensive pass
    • 100+ dynamic lights per frame
    • 10+ shadow casting lights per frame
    • AA means more of everything
  • Optimization
    • Avoid hard work
    • Work less for MSAA
    • Precompute sun shadow offline
    • Approximate

Avoid hard work

  • Don’t run shaders
    • Use early z/stencil cull unit
    • Depth bounds test is the new cool
    • Enable conditional rendering
  • Optimized light shaders
    • For each combination of light features
  • Fade out shadows for small lights
  • Remove small objects from shadow map

Lighting pass and MSAA

  • MSAA facts
    • Each sample has to be lit
    • Samples of non-edge pixel are equal
  • KZ2 solution – in shader supersampling
    • Run at 1280x720 not 2560x720
    • Light two samples in one go

Shadow map filtering distribution

  • Motivation
    • Define filtering quality per pixel rather than per sample.
  • Split filter coordinates into disjoint sets
    • One set per pixel sample
  • MSAA is almost as fast as non-MSAA

Sunlight

  • Fullscreen directional light
  • We divide screen into depth slices
  • Each depth slice is lit separately
    • Different shadow properties
    • Used depth bounds test
  • Use sun shadow from GBuffer
    • Stencil mark pixels completely in shadow
      • Skip expensive sunlight shader
    • Also mixed with real-time shadows

Sunlight rendering – Fake MSAA

  • Used only in distance pixels
  • Cut down lighting cost
    • Run lighting equation on closest sample only
  • Is this wrong?
    • Its a hack
    • Works correctly against background
    • The edges are still partially anti-aliased
    • Distant scenery is heavily post processed

Sunlight – shadow map rendering

  • Generate shadow map for each depth slice
  • Common approach
    • Align shadow map to view direction
    • Pros – max shadow map usage
    • Result – shadow map shimmering
  • Fix
      • Remove shadow map rotation
        • Align shadow maps to world instead of view
        • Remove sub-pixel movement
        • Cons – unused shadow map space

GPU driven memory allocation system

Push Buffer building

  • Multiple SPUs building PB in parallel
  • Additional SPUs generating data
    • Skinning, particles – VB
    • IBL interpolation – textures
  • Common solutions
    • Ring Buffering
      • Issue with out of order allocations
    • Double Buffering
      • Too much memory

KZ2 render memory allocator

  • Fixed mem pool
    • 22MB block – split into 256k blocks
  • Each block has associated AllocationID
    • Specified by client during allocation
    • Only whole block can be allocated
  • Global FreeID identify free blocks
    • Updated as RSX consumes ‘Free’ marker
  • Lockless, out of order, memory allocation
    • From PPU and/or SPU
    • Simple table walk (fast!)
  • Allows immediate memory reuse
    • WE generate push-buffer just in time for RSX
    • Block can be reused right after RSX consumption
  • Can allocate memory for skinning early…
    • and still free at correct point in frame

Direct3D 11 Tessellation Deep Dive (presented by Matt Lee)

High fidelity characters seem a bit out of reach of real-time apps (games).

10 to 30K chars not out of reach for 360/PS3

Striving for Cinematic Quality Characters

think in terms of triangles currently

 

Catmull-Clark subdivision surfaces

  • Industry standard subdivision surface scheme

Modern implementations don’t require too smooth

 

Direct3D11

  • Realtime rendering of Catmull-Clark
  • 3 new pipelline stages
  • Hardware design removes bandwidth bottlenecks from current implmentations
  • Better use of multi-core processors and improved shader management

(dynamic shader linking)

 

Direct3D11 Pipeline

  • Hull Shader
  • Tesselator
  • Domain Shader

Tessleation Data Flow

  • Hull shader – executed per patch
  • Tessellator – executed per patch (fixed-function) generates triangle
  • Domain shader – per tessllated vertex

Hull Shader

  • patch control points (input)
  • output to Domain Shader
  • two phases per patch(control points, patch constant)
  • patch constant output to tessellator (modifies behavior of tessellator)
  • control points go to domain shader

Tessellator

  • state from D3D API
  • input from Patch constant phase
  • generates tessellated triangles
  • out to later stages

Domain Shader

  • Hull Shader and Tessellator output
  • Smooth surface evaluated
  • one vertex
  • Control points are in GPU (saves bandwidth)

Where to use

  • LOD of terrain
  • Bezier patches from higher-order surfaces (Catmull-Clark)

Catmull-Clark

  • Baked into content early
  • Goal is real time
  • Disadvantage (have to build offline and huge memory issues)

Loop-Schaefer approximation (D3D11), others exist

Benefits

  • Content creation easier
  • Save memory
  • Easier LOD (doesn’t require sep meshes) so no use for MIP maps?

Pipeline

  • offline Load control mesh
  • offline compute adjacency for each quad
  • offline compute texture tangent space for each vertex
  • rt Morph & skin the quad mesh in the VS
  • rt convert quad mesh into patches in hull shader
  • rt Evaluate patches using domain shader
  • rt apply displayment map

tangent patches (fixes up surface normals) extrodianry vertex (<4 or > 4)

Available in March 2009 DX SDK (today), next release in June 09

SubD11 sample

Optimization will be performed when hardware is finalized

Current shader design is not expected to perform well on hardware.

Out of Order: Making In-Order Processors Play Nicely (presented by Allan Murphy)

VMX on the 360 for optimization of vector math

Slower than C counterpart, and out of order (so broken)

Missing out of order logic

  • no instruction reordering
  • no store forward hardware
  • smaller caches, slower memory
  • no l3 cache

 

  • LHS
  • L2 Miss
  • Expensive, non pipelined instructions
  • Branch mispredict penalty

Load Hit Store

  • Store to memory location, then load, flush the L2 cache
  • Casts, changing register set, aliasing
  • Passing by value, or by reference
  • On Pc, instruction reoder and store / forward hardware

L2 Miss

  • Loading from location, checks cache
  • Cost ~610 cycles to load cache line
  • Hot cold split
  • Reduce in-memory data size
  • Use cache coherent structures

Expensive Instructions

  • non pipelined instructions
  • Stalls hardware threads

Branch Mispredict

  • Mispredicting branch
  • 23-24 cycle delay
  • Know how the compiler implements branches
  • Reduce total branch count for task
  • Refactor calculations to remove branches
  • Unroll

Profiling!!!

360 Tools

  • PIX Cpu instruction trace
  • LibPMCPB counters
  • XbPerfView sampling capture

Other Platforms

  • SN Tuner, vTune

Think laterally

  • Inline functions
  • pass and return in register (_declspec(passinreg)
  • _restrict (complier released from being ultra careful
  • const

 

Compiler options

  • Inline
  • Prefer speed over size
  • Fast floaging point over precise
  • 360 (/Ou removing div by zero, /Oc runs a second code scheduling pass)
  • Reduce parameter counts
  • Prefer 32, 64, 128 bit parameters
  • Isoloate constants
  • Avoid virtual if feasible

Know you cache architecture

  • Cross core sharing policy (L2 shared, L1 single)
  • Prefetch mech (dcbt, dcbz128)
  • L2 1MB, L1 32Kb
  • Cache line 128 byte

Know your instruction set

  • 360 specific (VMX, slow instructions, fsel, vsel, vcmp*, vrlimi)
  • PS3 (altivec)
  • PC (SSE2-4.1 and friends)

What went wrong

  • Correctness
  • Guessed at 1 perf issue
  • SIMD vs straight float
  • Memory access and L2 usage unchanged
  • Branch behavior exactly the same

Image Analysis

  • Gaussian Mixture Model
  • Profiling showed (86% tiem in pixel cost function)